跳到主要内容

2025-05-14-12-04

Winning at All Cost: A Small Environment for Eliciting Specification Gaming Behaviors in Large Language Models

Abstract

arXiv:2505.07846v1 Announce Type: new Abstract: This study reveals how frontier Large Language Models LLMs can "game the system" when faced with impossible situations, a critical security and alignment concern. Using a novel textual simulation approach, we presented three leading LLMs (o1, o3-mini, and r1) with a tic-tac-toe scenario designed to be unwinnable through legitimate play, then analyzed their tendency to exploit loopholes rather than accept defeat. Our results are alarming for security researchers: the newer, reasoning-focused o3-mini model showed nearly twice the propensity to exploit system vulnerabilities (37.1%) compared to the older o1 model (17.5%). Most striking was the effect of prompting. Simply framing the task as requiring "creative" solutions caused gaming behaviors to skyrocket to 77.3% across all models. We identified four distinct exploitation strategies, from direct manipulation of game state to sophisticated modification of opponent behavior. These findings demonstrate that even without actual execution capabilities, LLMs can identify and propose sophisticated system exploits when incentivized, highlighting urgent challenges for AI alignment as models grow more capable of identifying and leveraging vulnerabilities in their operating environments.

摘要

本研究揭示了前沿大语言模型(LLMs)在面临不可能情境时如何"钻系统空子",这一发现对安全性和对齐性具有重要警示意义。通过创新的文本模拟方法,我们让三个领先的LLM模型(o1、o3-mini和r1)面对一个通过合法玩法无法获胜的井字棋场景,进而分析它们倾向于利用漏洞而非认输的行为。研究结果对安全研究人员发出警报:较新型、注重推理的o3-mini模型表现出近两倍于旧版o1模型(17.5%)的系统漏洞利用倾向(37.1%)。最显著的是提示词的影响——仅需将任务描述为需要"创造性"解决方案,所有模型的钻空行为就激增至77.3%。我们识别出四种不同的利用策略,从直接操纵游戏状态到复杂修改对手行为。这些发现表明,即使没有实际执行能力,当存在激励时,LLMs仍能识别并提出复杂的系统利用方案,这突显了随着模型识别和利用运行环境漏洞能力的提升,AI对齐问题面临的紧迫挑战。


Lost in Transmission: When and Why LLMs Fail to Reason Globally

Abstract

arXiv:2505.08140v1 Announce Type: new Abstract: Despite their many successes, transformer-based large language models (LLMs) continue to struggle with tasks that require complex reasoning over large parts of their input. We argue that these failures arise due to capacity limits on the accurate flow of information within LLMs. To formalize this issue, we introduce the bounded attention prefix oracle (BAPO) model, a new computational framework that models bandwidth constraints on attention heads, the mechanism for internal communication in LLMs. We show that several important reasoning problems like graph reachability require high communication bandwidth for BAPOs to solve; we call these problems BAPO-hard. Our experiments corroborate our theoretical predictions: GPT-4, Claude, and Gemini succeed on BAPO-easy tasks and fail even on relatively small BAPO-hard tasks. BAPOs also reveal another benefit of chain of thought (CoT): we prove that breaking down a task using CoT can turn any BAPO-hard problem into a BAPO-easy one. Our results offer principled explanations for key LLM failures and suggest directions for architectures and inference methods that mitigate bandwidth limits.

摘要

尽管取得了诸多成功,基于Transformer架构的大语言模型(LLM)在处理需要对其输入内容进行复杂推理的任务时仍存在困难。我们认为这些失败源于LLM内部信息流动准确性的容量限制。为系统阐述该问题,我们提出有界注意力前缀预言机(BAPO)模型——一种模拟注意力头部带宽约束的新计算框架(注意力机制是LLM内部通信的核心组件)。我们证明若干重要推理问题(如图可达性)需要BAPO具备高通信带宽才能解决,这类问题被定义为BAPO难题。实验验证了理论预测:GPT-4、Claude和Gemini能完成BAPO简易任务,但在相对小规模的BAPO难题上也会失败。BAPO还揭示了思维链(CoT)的另一优势:我们证明通过CoT分解任务可将任何BAPO难题转化为BAPO易解问题。这些发现为LLM关键失效模式提供了原理性解释,并为突破带宽限制的架构设计和推理方法指明了方向。


Patchwork: A Unified Framework for RAG Serving

Abstract

arXiv:2505.07833v1 Announce Type: new Abstract: Retrieval Augmented Generation (RAG) has emerged as a new paradigm for enhancing Large Language Model reliability through integration with external knowledge sources. However, efficient deployment of these systems presents significant technical challenges due to their inherently heterogeneous computational pipelines comprising LLMs, databases, and specialized processing components. We introduce Patchwork, a comprehensive end-to-end RAG serving framework designed to address these efficiency bottlenecks. Patchwork's architecture offers three key innovations: First, it provides a flexible specification interface enabling users to implement custom RAG pipelines. Secondly, it deploys these pipelines as distributed inference systems while optimizing for the unique scalability characteristics of individual RAG components. Third, Patchwork incorporates an online scheduling mechanism that continuously monitors request load and execution progress, dynamically minimizing SLO violations through strategic request prioritization and resource auto-scaling. Our experimental evaluation across four distinct RAG implementations demonstrates that Patchwork delivers substantial performance improvements over commercial alternatives, achieving throughput gains exceeding 48% while simultaneously reducing SLO violations by ~24%.

摘要

检索增强生成(RAG)作为一种通过整合外部知识源来提升大语言模型可靠性的新范式已经兴起。然而,由于这类系统本质上由大语言模型、数据库和专用处理组件构成的异构计算管道,其高效部署面临着重大技术挑战。我们提出了Patchwork——一个旨在解决这些效率瓶颈的端到端RAG服务框架。该架构具有三项关键创新:首先,它提供了灵活的规范接口,使用户能够实现自定义RAG流程;其次,它将流程部署为分布式推理系统,同时针对各RAG组件独特的可扩展性特征进行优化;第三,Patchwork整合了在线调度机制,持续监控请求负载与执行进度,通过策略性请求优先级调度和资源自动扩缩容,动态减少服务等级目标(SLO)违约。在四种不同RAG实现上的实验评估表明,Patchwork相较商业替代方案展现出显著性能提升,吞吐量增益超过48%,同时将SLO违约率降低约24%。


Benchmarking AI scientists in omics data-driven biological research

Abstract

arXiv:2505.08341v1 Announce Type: new Abstract: The rise of large language models and multi-agent systems has sparked growing interest in AI scientists capable of autonomous biological research. However, existing benchmarks either focus on reasoning without data or on data analysis with predefined statistical answers, lacking realistic, data-driven evaluation settings. Here, we introduce the Biological AI Scientist Benchmark (BaisBench), a benchmark designed to assess AI scientists' ability to generate biological discoveries through data analysis and reasoning with external knowledge. BaisBench comprises two tasks: cell type annotation on 31 expert-labeled single-cell datasets, and scientific discovery through answering 198 multiple-choice questions derived from the biological insights of 41 recent single-cell studies. Systematic experiments on state-of-the-art AI scientists and LLM agents showed that while promising, current models still substantially underperform human experts on both tasks. We hope BaisBench will fill this gap and serve as a foundation for advancing and evaluating AI models for scientific discovery. The benchmark can be found at: https://github.com/EperLuo/BaisBench.

摘要

大型语言模型与多智能体系统的兴起,引发了人们对能够自主开展生物学研究的人工智能科学家的日益关注。然而现有基准测试要么聚焦于无数据支持的推理任务,要么局限于提供预设统计答案的数据分析,缺乏真实数据驱动的评估场景。为此,我们提出生物AI科学家基准(BaisBench),该基准旨在评估AI科学家通过数据分析与外部知识推理生成生物学发现的能力。BaisBench包含两项任务:基于31个专家标注单细胞数据集的细胞类型注释,以及通过回答198道源自41项最新单细胞研究生物学见解的多选题来实现科学发现。针对前沿AI科学家与LLM智能体的系统实验表明,尽管当前模型展现出潜力,但在两项任务上的表现仍显著低于人类专家水平。我们希望BaisBench能够填补这一空白,并为科学发现AI模型的推进与评估奠定基础。基准测试地址:https://github.com/EperLuo/BaisBench。


Decoding Neighborhood Environments with Large Language Models

Abstract

arXiv:2505.08163v1 Announce Type: new Abstract: Neighborhood environments include physical and environmental conditions such as housing quality, roads, and sidewalks, which significantly influence human health and well-being. Traditional methods for assessing these environments, including field surveys and geographic information systems (GIS), are resource-intensive and challenging to evaluate neighborhood environments at scale. Although machine learning offers potential for automated analysis, the laborious process of labeling training data and the lack of accessible models hinder scalability. This study explores the feasibility of large language models (LLMs) such as ChatGPT and Gemini as tools for decoding neighborhood environments (e.g., sidewalk and powerline) at scale. We train a robust YOLOv11-based model, which achieves an average accuracy of 99.13% in detecting six environmental indicators, including streetlight, sidewalk, powerline, apartment, single-lane road, and multilane road. We then evaluate four LLMs, including ChatGPT, Gemini, Claude, and Grok, to assess their feasibility, robustness, and limitations in identifying these indicators, with a focus on the impact of prompting strategies and fine-tuning. We apply majority voting with the top three LLMs to achieve over 88% accuracy, which demonstrates LLMs could be a useful tool to decode the neighborhood environment without any training effort.

摘要

邻里环境包含住房质量、道路及人行道等物理与环境条件,这些因素对人类健康与福祉具有显著影响。传统评估方法(如实地调查和地理信息系统)需要大量资源,难以实现大规模环境评估。尽管机器学习为自动化分析提供了可能,但训练数据标注的繁琐过程及可用模型的缺乏阻碍了其扩展应用。本研究探讨了ChatGPT、Gemini等大语言模型作为大规模解码邻里环境(如人行道与电力线)工具的可行性。我们训练了基于YOLOv11的鲁棒模型,在检测街灯、人行道、电力线、公寓楼、单车道及多车道道路六类环境指标时达到99.13%的平均准确率。随后评估了ChatGPT、Gemini、Claude和Grok四款大语言模型在识别这些指标时的可行性、鲁棒性及局限性,重点分析了提示策略与微调的影响。通过采用前三名大语言模型的多数投票法,实现了超过88%的准确率,证明大语言模型无需训练即可成为解码邻里环境的有效工具。


Resource-Efficient Language Models: Quantization for Fast and Accessible Inference

Abstract

arXiv:2505.08620v1 Announce Type: new Abstract: Large language models have significantly advanced natural language processing, yet their heavy resource demands pose severe challenges regarding hardware accessibility and energy consumption. This paper presents a focused and high-level review of post-training quantization (PTQ) techniques designed to optimize the inference efficiency of LLMs by the end-user, including details on various quantization schemes, granularities, and trade-offs. The aim is to provide a balanced overview between the theory and applications of post-training quantization.

摘要

大型语言模型在自然语言处理领域取得了显著进展,但其高昂的资源需求对硬件可及性和能源消耗提出了严峻挑战。本文针对终端用户优化大模型推理效率的训练后量化技术,从量化方案、粒度选择与性能权衡等维度展开系统性综述。研究旨在为训练后量化的理论框架与实际应用提供平衡的学术视角。


Learning Like Humans: Advancing LLM Reasoning Capabilities via Adaptive Difficulty Curriculum Learning and Expert-Guided Self-Reformulation

Abstract

arXiv:2505.08364v1 Announce Type: new Abstract: Despite impressive progress in areas like mathematical reasoning, large language models still face significant challenges in consistently solving complex problems. Drawing inspiration from key human learning strategies, we propose two novel strategies to enhance the capability of large language models to solve these complex problems. First, Adaptive Difficulty Curriculum Learning (ADCL) is a novel curriculum learning strategy that tackles the Difficulty Shift phenomenon (i.e., a model's perception of problem difficulty dynamically changes during training) by periodically re-estimating difficulty within upcoming data batches to maintain alignment with the model's evolving capabilities. Second, Expert-Guided Self-Reformulation (EGSR) is a novel reinforcement learning strategy that bridges the gap between imitation learning and pure exploration by guiding models to reformulate expert solutions within their own conceptual framework, rather than relying on direct imitation, fostering deeper understanding and knowledge assimilation. Extensive experiments on challenging mathematical reasoning benchmarks, using Qwen2.5-7B as the base model, demonstrate that these human-inspired strategies synergistically and significantly enhance performance. Notably, their combined application improves performance over the standard Zero-RL baseline by 10% on the AIME24 benchmark and 16.6% on AIME25.

摘要

尽管大语言模型在数学推理等领域取得了显著进展,但在持续解决复杂问题方面仍面临重大挑战。受人类关键学习策略的启发,我们提出两种创新策略来增强大语言模型解决复杂问题的能力。首先,自适应难度课程学习(ADCL)是一种新型课程学习策略,通过定期重新评估后续数据批次的难度以保持与模型动态演进能力的匹配,从而解决"难度迁移"现象(即模型对问题难度的感知在训练过程中动态变化)。其次,专家引导自我重构(EGSR)是一种新型强化学习策略,通过引导模型在其自身概念框架内重构专家解决方案(而非依赖直接模仿),在模仿学习与纯粹探索之间建立桥梁,促进更深层次的理解和知识内化。基于Qwen2.5-7B基础模型在复杂数学推理基准上的大量实验表明,这些受人类启发的策略能产生显著的协同增效作用。特别值得注意的是,在AIME24基准测试中组合应用这些策略比标准Zero-RL基线性能提升10%,在AIME25基准测试中提升16.6%。


Scalable UAV Multi-Hop Networking via Multi-Agent Reinforcement Learning with Large Language Models

Abstract

arXiv:2505.08448v1 Announce Type: new Abstract: In disaster scenarios, establishing robust emergency communication networks is critical, and unmanned aerial vehicles (UAVs) offer a promising solution to rapidly restore connectivity. However, organizing UAVs to form multi-hop networks in large-scale dynamic environments presents significant challenges, including limitations in algorithmic scalability and the vast exploration space required for coordinated decision-making. To address these issues, we propose MRLMN, a novel framework that integrates multi-agent reinforcement learning (MARL) and large language models (LLMs) to jointly optimize UAV agents toward achieving optimal networking performance. The framework incorporates a grouping strategy with reward decomposition to enhance algorithmic scalability and balance decision-making across UAVs. In addition, behavioral constraints are applied to selected key UAVs to improve the robustness of the network. Furthermore, the framework integrates LLM agents, leveraging knowledge distillation to transfer their high-level decision-making capabilities to MARL agents. This enhances both the efficiency of exploration and the overall training process. In the distillation module, a Hungarian algorithm-based matching scheme is applied to align the decision outputs of the LLM and MARL agents and define the distillation loss. Extensive simulation results validate the effectiveness of our approach, demonstrating significant improvements in network performance, including enhanced coverage and communication quality.

摘要

在灾难场景中,建立稳健的应急通信网络至关重要,而无人机为快速恢复连接提供了可行方案。然而在大规模动态环境中组织无人机形成多跳网络面临重大挑战,包括算法可扩展性的限制以及协同决策所需的大规模探索空间。为解决这些问题,我们提出MRLMN框架,该框架整合多智能体强化学习与大型语言模型,通过联合优化无人机智能体以实现最优组网性能。该框架采用分组策略与奖励分解机制以增强算法可扩展性并平衡无人机间的决策制定。此外,通过对关键无人机施加行为约束来提高网络鲁棒性。进一步地,框架集成大型语言模型智能体,利用知识蒸馏技术将其高层决策能力迁移至多智能体强化学习智能体,从而提升探索效率并优化整体训练过程。在蒸馏模块中,采用基于匈牙利算法的匹配方案来对齐语言模型与强化学习智能体的决策输出,并据此定义蒸馏损失。大量仿真结果验证了本方法的有效性,在网络覆盖范围和通信质量等性能指标上均展现出显著提升。


Evaluating LLM Metrics Through Real-World Capabilities

Abstract

arXiv:2505.08253v1 Announce Type: new Abstract: As generative AI becomes increasingly embedded in everyday workflows, it is important to evaluate its performance in ways that reflect real-world usage rather than abstract notions of intelligence. Unlike many existing benchmarks that assess general intelligence, our approach focuses on real-world utility, evaluating how well models support users in everyday tasks. While current benchmarks emphasize code generation or factual recall, users rely on AI for a much broader range of activities-from writing assistance and summarization to citation formatting and stylistic feedback. In this paper, we analyze large-scale survey data and usage logs to identify six core capabilities that represent how people commonly use Large Language Models (LLMs): Summarization, Technical Assistance, Reviewing Work, Data Structuring, Generation, and Information Retrieval. We then assess the extent to which existing benchmarks cover these capabilities, revealing significant gaps in coverage, efficiency measurement, and interpretability. Drawing on this analysis, we use human-centered criteria to identify gaps in how well current benchmarks reflect common usage that is grounded in five practical criteria: coherence, accuracy, clarity, relevance, and efficiency. For four of the six capabilities, we identify the benchmarks that best align with real-world tasks and use them to compare leading models. We find that Google Gemini outperforms other models-including OpenAI's GPT, xAI's Grok, Meta's LLaMA, Anthropic's Claude, DeepSeek, and Qwen from Alibaba-on these utility-focused metrics.

摘要

随着生成式人工智能日益融入日常工作流程,评估其性能的方式应反映真实使用场景而非抽象的智能概念。与众多评估通用智能的现有基准不同,我们的方法聚焦现实效用,评估模型在日常任务中对用户的支持程度。当前基准多关注代码生成或事实记忆,而用户依赖AI完成更广泛的活动——从写作辅助、摘要生成到文献格式化和文体反馈。本文通过分析大规模调查数据和使用日志,确定了人们使用大语言模型(LLMs)的六大核心能力:摘要生成、技术协助、工作审查、数据结构化、内容生成和信息检索。随后我们评估现有基准对这些能力的覆盖程度,发现其在覆盖范围、效率测量和可解释性方面存在显著缺陷。基于此分析,我们采用以人为中心的标准,依据五项实践准则(连贯性、准确性、清晰度、相关性和效率)揭示了当前基准在反映常见使用场景方面的不足。针对六大能力中的四项,我们筛选出最贴合实际任务的基准测试,并借此比较主流模型性能。研究发现,在这些实用导向的指标上,谷歌Gemini的表现优于其他模型——包括OpenAI的GPT、xAI的Grok、Meta的LLaMA、Anthropic的Claude、深度求索(DeepSeek)以及阿里巴巴的Qwen。


Guiding LLM-based Smart Contract Generation with Finite State Machine

Abstract

arXiv:2505.08542v1 Announce Type: new Abstract: Smart contract is a kind of self-executing code based on blockchain technology with a wide range of application scenarios, but the traditional generation method relies on manual coding and expert auditing, which has a high threshold and low efficiency. Although Large Language Models (LLMs) show great potential in programming tasks, they still face challenges in smart contract generation w.r.t. effectiveness and security. To solve these problems, we propose FSM-SCG, a smart contract generation framework based on finite state machine (FSM) and LLMs, which significantly improves the quality of the generated code by abstracting user requirements to generate FSM, guiding LLMs to generate smart contracts, and iteratively optimizing the code with the feedback of compilation and security checks. The experimental results show that FSM-SCG significantly improves the quality of smart contract generation. Compared to the best baseline, FSM-SCG improves the compilation success rate of generated smart contract code by at most 48%, and reduces the average vulnerability risk score by approximately 68%.

摘要

智能合约是一种基于区块链技术的自执行代码,具有广泛的应用场景,但传统生成方法依赖人工编码和专家审核,存在门槛高、效率低的问题。尽管大语言模型(LLMs)在编程任务中展现出巨大潜力,但在智能合约生成的有效性和安全性方面仍面临挑战。为解决这些问题,我们提出FSM-SCG——一种基于有限状态机(FSM)和LLMs的智能合约生成框架。该框架通过将用户需求抽象为FSM来指导LLMs生成智能合约,并利用编译反馈和安全检查迭代优化代码,显著提升了生成代码的质量。实验结果表明,FSM-SCG显著改善了智能合约生成质量:相较于最佳基线方法,其生成的智能合约代码编译成功率最高提升48%,平均漏洞风险评分降低约68%。


Strategy-Augmented Planning for Large Language Models via Opponent Exploitation

Abstract

arXiv:2505.08459v1 Announce Type: new Abstract: Efficiently modeling and exploiting opponents is a long-standing challenge in adversarial domains. Large Language Models (LLMs) trained on extensive textual data have recently demonstrated outstanding performance in general tasks, introducing new research directions for opponent modeling. Some studies primarily focus on directly using LLMs to generate decisions based on the elaborate prompt context that incorporates opponent descriptions, while these approaches are limited to scenarios where LLMs possess adequate domain expertise. To address that, we introduce a two-stage Strategy-Augmented Planning (SAP) framework that significantly enhances the opponent exploitation capabilities of LLM-based agents by utilizing a critical component, the Strategy Evaluation Network (SEN). Specifically, in the offline stage, we construct an explicit strategy space and subsequently collect strategy-outcome pair data for training the SEN network. During the online phase, SAP dynamically recognizes the opponent's strategies and greedily exploits them by searching best response strategy on the well-trained SEN, finally translating strategy to a course of actions by carefully designed prompts. Experimental results show that SAP exhibits robust generalization capabilities, allowing it to perform effectively not only against previously encountered opponent strategies but also against novel, unseen strategies. In the MicroRTS environment, SAP achieves a 85.35% performance improvement over baseline methods and matches the competitiveness of reinforcement learning approaches against state-of-the-art (SOTA) rule-based AI.

摘要

高效建模并利用对手行为是对抗性领域长期存在的挑战。近期,基于海量文本数据训练的大语言模型(LLM)在通用任务中展现出卓越性能,为对手建模研究开辟了新方向。现有研究主要集中于通过精心设计的提示上下文(包含对手描述)直接利用LLM生成决策,但这类方法仅适用于LLM具备充分领域知识的场景。为此,我们提出两阶段策略增强规划(SAP)框架,通过关键组件策略评估网络(SEN)显著提升基于LLM的智能体剥削对手能力。具体而言,在离线阶段构建显式策略空间并收集策略-结果配对数据训练SEN网络;在线阶段动态识别对手策略,通过训练完备的SEN贪婪搜索最优响应策略,最终借助精心设计的提示将策略转化为行动序列。实验结果表明,SAP具有强大泛化能力,不仅能有效应对已知对手策略,对未见新策略同样表现优异。在MicroRTS环境中,SAP相较基线方法实现85.35%的性能提升,并与基于规则的先进AI(SOTA)对抗时达到与强化学习方法相当的竞争力。


TrialMatchAI: An End-to-End AI-powered Clinical Trial Recommendation System to Streamline Patient-to-Trial Matching

Abstract

arXiv:2505.08508v1 Announce Type: new Abstract: Patient recruitment remains a major bottleneck in clinical trials, calling for scalable and automated solutions. We present TrialMatchAI, an AI-powered recommendation system that automates patient-to-trial matching by processing heterogeneous clinical data, including structured records and unstructured physician notes. Built on fine-tuned, open-source large language models (LLMs) within a retrieval-augmented generation framework, TrialMatchAI ensures transparency and reproducibility and maintains a lightweight deployment footprint suitable for clinical environments. The system normalizes biomedical entities, retrieves relevant trials using a hybrid search strategy combining lexical and semantic similarity, re-ranks results, and performs criterion-level eligibility assessments using medical Chain-of-Thought reasoning. This pipeline delivers explainable outputs with traceable decision rationales. In real-world validation, 92 percent of oncology patients had at least one relevant trial retrieved within the top 20 recommendations. Evaluation across synthetic and real clinical datasets confirmed state-of-the-art performance, with expert assessment validating over 90 percent accuracy in criterion-level eligibility classification, particularly excelling in biomarker-driven matches. Designed for modularity and privacy, TrialMatchAI supports Phenopackets-standardized data, enables secure local deployment, and allows seamless replacement of LLM components as more advanced models emerge. By enhancing efficiency and interpretability and offering lightweight, open-source deployment, TrialMatchAI provides a scalable solution for AI-driven clinical trial matching in precision medicine.

摘要

患者招募仍是临床试验中的主要瓶颈,亟需可扩展的自动化解决方案。我们推出TrialMatchAI——一个基于人工智能的推荐系统,通过处理结构化病历和非结构化医师笔记等异构临床数据,实现患者-试验匹配的自动化。该系统基于检索增强生成框架下的微调开源大语言模型(LLMs),在保证透明度和可重复性的同时,保持了适合临床环境的轻量级部署特性。该系统可标准化生物医学实体,通过结合词法和语义相似度的混合搜索策略检索相关试验,对结果进行重排序,并采用医学思维链推理进行标准级别的资格评估。该流程能提供具有可追溯决策依据的可解释输出。在实际验证中,92%的肿瘤患者在推荐前20项结果中至少匹配到一项相关试验。在合成和真实临床数据集上的评估证实了其领先性能,专家评估显示其在标准级别资格分类中的准确率超过90%,尤其在生物标志物驱动的匹配方面表现优异。TrialMatchAI采用模块化隐私设计,支持Phenopackets标准化数据,可实现安全的本地部署,并能随着更先进模型的出现无缝替换LLM组件。通过提升效率与可解释性,并提供轻量级开源部署方案,TrialMatchAI为精准医学领域的人工智能驱动临床试验匹配提供了可扩展的解决方案。


Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models

Abstract

arXiv:2505.08622v1 Announce Type: new Abstract: Text-to-image generative models like DALL-E and Stable Diffusion have revolutionized visual content creation across various applications, including advertising, personalized media, and design prototyping. However, crafting effective textual prompts to guide these models remains challenging, often requiring extensive trial and error. Existing prompt inversion approaches, such as soft and hard prompt techniques, are not so effective due to the limited interpretability and incoherent prompt generation. To address these issues, we propose Visually Guided Decoding (VGD), a gradient-free approach that leverages large language models (LLMs) and CLIP-based guidance to generate coherent and semantically aligned prompts. In essence, VGD utilizes the robust text generation capabilities of LLMs to produce human-readable prompts. Further, by employing CLIP scores to ensure alignment with user-specified visual concepts, VGD enhances the interpretability, generalization, and flexibility of prompt generation without the need for additional training. Our experiments demonstrate that VGD outperforms existing prompt inversion techniques in generating understandable and contextually relevant prompts, facilitating more intuitive and controllable interactions with text-to-image models.

摘要

诸如DALL-E和Stable Diffusion等文本到图像生成模型彻底改变了广告、个性化媒体和设计原型等各类应用中的视觉内容创作。然而,设计有效的文本来引导这些模型仍然具有挑战性,通常需要大量的试错。现有的提示反转方法(如软提示和硬提示技术)由于可解释性有限且生成的提示不连贯,效果欠佳。为解决这些问题,我们提出了视觉引导解码(VGD),这是一种无需梯度的新方法,利用大语言模型(LLM)和基于CLIP的引导来生成连贯且语义对齐的提示。本质上,VGD利用LLM强大的文本生成能力来生成人类可读的提示。此外,通过采用CLIP分数确保与用户指定的视觉概念对齐,VGD在无需额外训练的情况下,提升了提示生成的可解释性、泛化能力和灵活性。实验表明,VGD在生成易于理解且上下文相关的提示方面优于现有提示反转技术,有助于实现与文本到图像模型更直观、可控的交互。


Achieving Scalable Robot Autonomy via neurosymbolic planning using lightweight local LLM

Abstract

arXiv:2505.08492v1 Announce Type: new Abstract: PDDL-based symbolic task planning remains pivotal for robot autonomy yet struggles with dynamic human-robot collaboration due to scalability, re-planning demands, and delayed plan availability. Although a few neurosymbolic frameworks have previously leveraged LLMs such as GPT-3 to address these challenges, reliance on closed-source, remote models with limited context introduced critical constraints: third-party dependency, inconsistent response times, restricted plan length and complexity, and multi-domain scalability issues. We present Gideon, a novel framework that enables the transition to modern, smaller, local LLMs with extended context length. Gideon integrates a novel problem generator to systematically generate large-scale datasets of realistic domain-problem-plan tuples for any domain, and adapts neurosymbolic planning for local LLMs, enabling on-device execution and extended context for multi-domain support. Preliminary experiments in single-domain scenarios performed on Qwen-2.5 1.5B and trained on 8k-32k samples, demonstrate a valid plan percentage of 66.1% (32k model) and show that the figure can be further scaled through additional data. Multi-domain tests on 16k samples yield an even higher 70.6% planning validity rate, proving extensibility across domains and signaling that data variety can have a positive effect on learning efficiency. Although long-horizon planning and reduced model size make Gideon training much less efficient than baseline models based on larger LLMs, the results are still significant considering that the trained model is about 120x smaller than baseline and that significant advantages can be achieved in inference efficiency, scalability, and multi-domain adaptability, all critical factors in human-robot collaboration. Training inefficiency can be mitigated by Gideon's streamlined data generation pipeline.

摘要

基于PDDL的符号化任务规划在机器人自主性中仍具关键地位,但由于可扩展性、重规划需求和延迟的计划可用性等问题,其在动态人机协作中的应用面临挑战。尽管已有少数神经符号框架利用GPT-3等大型语言模型应对这些挑战,但依赖上下文有限的闭源远程模型带来了关键限制:第三方依赖性、响应时间不稳定、规划长度与复杂度受限以及多领域可扩展性问题。本文提出Gideon新型框架,通过扩展上下文长度实现向现代小型本地语言模型的过渡。该框架集成创新的问题生成器,可系统化生成任意领域的大规模现实领域-问题-规划三元组数据集,并针对本地语言模型调整神经符号规划方法,支持设备端执行和跨领域扩展上下文。在Qwen-2.5 1.5B模型上进行的单领域初步实验(基于8k-32k样本训练)显示有效规划率达66.1%(32k模型),表明该指标可通过增加数据进一步提升。基于16k样本的多领域测试获得70.6%的更高规划有效率,证实了跨领域扩展能力,表明数据多样性对学习效率具有积极影响。尽管长周期规划和小型化模型使Gideon训练效率显著低于基于大型语言模型的基线模型,但考虑到训练模型体积缩小约120倍,且在推理效率、可扩展性和多领域适应性等人机协作关键因素上取得显著优势,研究成果仍具重要意义。通过Gideon的流线型数据生成流程,可有效缓解训练效率不足的问题。


TRAIL: Trace Reasoning and Agentic Issue Localization

Abstract

arXiv:2505.08638v1 Announce Type: new Abstract: The increasing adoption of agentic workflows across diverse domains brings a critical need to scalably and systematically evaluate the complex traces these systems generate. Current evaluation methods depend on manual, domain-specific human analysis of lengthy workflow traces - an approach that does not scale with the growing complexity and volume of agentic outputs. Error analysis in these settings is further complicated by the interplay of external tool outputs and language model reasoning, making it more challenging than traditional software debugging. In this work, we (1) articulate the need for robust and dynamic evaluation methods for agentic workflow traces, (2) introduce a formal taxonomy of error types encountered in agentic systems, and (3) present a set of 148 large human-annotated traces (TRAIL) constructed using this taxonomy and grounded in established agentic benchmarks. To ensure ecological validity, we curate traces from both single and multi-agent systems, focusing on real-world applications such as software engineering and open-world information retrieval. Our evaluations reveal that modern long context LLMs perform poorly at trace debugging, with the best Gemini-2.5-pro model scoring a mere 11% on TRAIL. Our dataset and code are made publicly available to support and accelerate future research in scalable evaluation for agentic workflows.

摘要

随着智能体工作流在各领域的广泛应用,对这些系统生成的复杂轨迹进行可扩展、系统化评估的需求日益凸显。当前评估方法依赖于人工对冗长工作流轨迹进行领域特异性分析,这种模式难以应对智能体输出日益增长的复杂性和规模。外部工具输出与语言模型推理的交互作用使得错误分析比传统软件调试更为困难。本研究(1)阐明了智能体工作流轨迹评估对鲁棒动态方法的需求,(2)提出了智能体系统错误类型的规范化分类体系,(3)基于该分类体系构建了包含148条人工标注轨迹的TRAIL数据集,其植根于成熟的智能体基准测试。为确保生态效度,我们采集了单智能体与多智能体系统的真实应用轨迹,重点关注软件工程和开放世界信息检索等实际场景。评估表明,现代长上下文LLM在轨迹调试中表现欠佳,性能最佳的Gemini-2.5-pro模型在TRAIL上仅获得11%的得分。我们公开了数据集与代码,以支持和加速智能体工作流可扩展评估的未来研究。


LLM-based Prompt Ensemble for Reliable Medical Entity Recognition from EHRs

Abstract

arXiv:2505.08704v1 Announce Type: new Abstract: Electronic Health Records (EHRs) are digital records of patient information, often containing unstructured clinical text. Named Entity Recognition (NER) is essential in EHRs for extracting key medical entities like problems, tests, and treatments to support downstream clinical applications. This paper explores prompt-based medical entity recognition using large language models (LLMs), specifically GPT-4o and DeepSeek-R1, guided by various prompt engineering techniques, including zero-shot, few-shot, and an ensemble approach. Among all strategies, GPT-4o with prompt ensemble achieved the highest classification performance with an F1-score of 0.95 and recall of 0.98, outperforming DeepSeek-R1 on the task. The ensemble method improved reliability by aggregating outputs through embedding-based similarity and majority voting.

摘要

电子健康记录(EHR)是患者信息的数字化记录,通常包含非结构化的临床文本。命名实体识别(NER)在EHR中对于提取关键医疗实体(如问题、检查和治疗)以支持下游临床应用至关重要。本文探讨了基于提示的大型语言模型(LLM)(特别是GPT-4o和DeepSeek-R1)在医疗实体识别中的应用,并采用了多种提示工程技术,包括零样本、少样本和集成方法。在所有策略中,采用提示集成的GPT-4o取得了最高的分类性能,F1分数达到0.95,召回率为0.98,在该任务中表现优于DeepSeek-R1。集成方法通过基于嵌入的相似性和多数投票聚合输出,提高了可靠性。


Polysemy of Synthetic Neurons Towards a New Type of Explanatory Categorical Vector Spaces

Abstract

arXiv:2505.07831v1 Announce Type: cross Abstract: The polysemantic nature of synthetic neurons in artificial intelligence language models is currently understood as the result of a necessary superposition of distributed features within the latent space. We propose an alternative approach, geometrically defining a neuron in layer n as a categorical vector space with a non-orthogonal basis, composed of categorical sub-dimensions extracted from preceding neurons in layer n-1. This categorical vector space is structured by the activation space of each neuron and enables, via an intra-neuronal attention process, the identification and utilization of a critical categorical zone for the efficiency of the language model - more homogeneous and located at the intersection of these different categorical sub-dimensions.

摘要

人工智慧语言模型中合成神经元的多义性,目前被理解为潜在空间内分布式特征必要叠加的结果。我们提出一种几何学替代方案,将第n层神经元定义为具有非正交基的范畴向量空间,其由从第n-1层前驱神经元提取的范畴子维度构成。该范畴向量空间通过每个神经元的激活空间进行结构化,并借助神经元内注意力机制,能够识别并利用语言模型效率的关键范畴区域——该区域更具同质性且位于不同范畴子维度的交汇处。


DeepMath-Creative: A Benchmark for Evaluating Mathematical Creativity of Large Language Models

Abstract

arXiv:2505.08744v1 Announce Type: new Abstract: To advance the mathematical proficiency of large language models (LLMs), the DeepMath team has launched an open-source initiative aimed at developing an open mathematical LLM and systematically evaluating its mathematical creativity. This paper represents the initial contribution of this initiative. While recent developments in mathematical LLMs have predominantly emphasized reasoning skills, as evidenced by benchmarks on elementary to undergraduate-level mathematical tasks, the creative capabilities of these models have received comparatively little attention, and evaluation datasets remain scarce. To address this gap, we propose an evaluation criteria for mathematical creativity and introduce DeepMath-Creative, a novel, high-quality benchmark comprising constructive problems across algebra, geometry, analysis, and other domains. We conduct a systematic evaluation of mainstream LLMs' creative problem-solving abilities using this dataset. Experimental results show that even under lenient scoring criteria -- emphasizing core solution components and disregarding minor inaccuracies, such as small logical gaps, incomplete justifications, or redundant explanations -- the best-performing model, O3 Mini, achieves merely 70% accuracy, primarily on basic undergraduate-level constructive tasks. Performance declines sharply on more complex problems, with models failing to provide substantive strategies for open problems. These findings suggest that, although current LLMs display a degree of constructive proficiency on familiar and lower-difficulty problems, such performance is likely attributable to the recombination of memorized patterns rather than authentic creative insight or novel synthesis.

摘要

为推动大语言模型(LLMs)的数学能力发展,DeepMath团队发起了一项开源计划,旨在开发开源数学大语言模型并系统评估其数学创造力。本文是该计划的首项成果。当前数学大语言模型的发展主要聚焦于推理能力(体现在初等至本科阶段数学任务的基准测试上),而模型的创造性能力却鲜少受到关注,评估数据集也较为匮乏。为填补这一空白,我们提出了数学创造力的评估标准,并构建了DeepMath-Creative——一个包含代数、几何、分析等多个领域构造性问题的全新高质量基准数据集。基于该数据集,我们对主流大语言模型的创造性问题解决能力进行了系统评估。实验结果表明,即使在宽松评分标准下(强调核心解题要素,忽略次要瑕疵如微小逻辑漏洞、不完整论证或冗余解释),表现最佳的O3 Mini模型在基础本科水平构造性任务中也仅达到70%准确率。面对更复杂问题时,模型性能急剧下降,且无法为开放性问题提供实质性解决策略。这些发现表明,尽管当前大语言模型在熟悉度较高、难度较低的问题上展现出一定的构造性能力,但这种表现很可能源于对记忆模式的重组,而非真正的创造性洞见或新颖综合。


A Tale of Two Identities: An Ethical Audit of Human and AI-Crafted Personas

Abstract

arXiv:2505.07850v1 Announce Type: cross Abstract: As LLMs (large language models) are increasingly used to generate synthetic personas particularly in data-limited domains such as health, privacy, and HCI, it becomes necessary to understand how these narratives represent identity, especially that of minority communities. In this paper, we audit synthetic personas generated by 3 LLMs (GPT4o, Gemini 1.5 Pro, Deepseek 2.5) through the lens of representational harm, focusing specifically on racial identity. Using a mixed methods approach combining close reading, lexical analysis, and a parameterized creativity framework, we compare 1512 LLM generated personas to human-authored responses. Our findings reveal that LLMs disproportionately foreground racial markers, overproduce culturally coded language, and construct personas that are syntactically elaborate yet narratively reductive. These patterns result in a range of sociotechnical harms, including stereotyping, exoticism, erasure, and benevolent bias, that are often obfuscated by superficially positive narrations. We formalize this phenomenon as algorithmic othering, where minoritized identities are rendered hypervisible but less authentic. Based on these findings, we offer design recommendations for narrative-aware evaluation metrics and community-centered validation protocols for synthetic identity generation.

摘要

随着大型语言模型(LLMs)越来越多地被用于生成合成人物角色——尤其在健康、隐私和人机交互等数据有限的领域,理解这些叙事如何表征身份(特别是少数群体身份)变得至关重要。本文通过表征危害的视角,对三种LLM(GPT4o、Gemini 1.5 Pro、Deepseek 2.5)生成的合成人物角色进行审计,重点关注种族身份。采用混合研究方法结合细读分析、词汇分析和参数化创造力框架,我们将1512个LLM生成的角色与人类撰写的回答进行对比。研究发现:LLM会不成比例地突出种族标记、过度使用文化编码语言,并构建出句法复杂但叙事简化的角色。这些模式导致了一系列社会技术危害,包括刻板印象、异域化、身份抹除和善意偏见,这些危害往往被表面积极的叙述所掩盖。我们将这种现象形式化为"算法他者化",即少数群体身份被过度显化却更失真实。基于这些发现,我们提出了针对合成身份生成的设计建议,包括叙事感知的评估指标和以社区为中心的验证协议。


Boosting Performance on ARC is a Matter of Perspective

Abstract

arXiv:2505.07859v1 Announce Type: cross Abstract: The Abstraction and Reasoning Corpus (ARC-AGI) poses a significant challenge for large language models (LLMs), exposing limitations in their abstract reasoning abilities. In this work, we leverage task-specific data augmentations throughout the training, generation, and scoring phases, and employ a depth-first search algorithm to generate diverse, high-probability candidate solutions. Furthermore, we utilize the LLM not only as a generator but also as a scorer, using its output probabilities to select the most promising solutions. Our method achieves a score of 71.6% (286.5/400 solved tasks) on the public ARC-AGI evaluation set, demonstrating state-of-the-art performance among publicly available approaches. While concurrent closed-source work has reported higher scores, our method distinguishes itself through its transparency, reproducibility, and remarkably low inference cost, averaging only around 2ct per task on readily available hardware (we assume a price of 36ct/hour for a Nvidia 4090 GPU).

摘要

抽象与推理语料库(ARC-AGI)对大型语言模型(LLM)提出了重大挑战,暴露出其在抽象推理能力上的局限性。本研究通过在训练、生成和评分阶段采用任务特定的数据增强方法,并利用深度优先搜索算法生成多样化的高概率候选解决方案。此外,我们不仅将LLM作为生成器,还将其作为评分器,利用其输出概率选择最有潜力的解决方案。我们的方法在公开的ARC-AGI评估集上获得了71.6%的得分(解决了286.5/400项任务),在公开可用方法中展现了最先进的性能。尽管同期闭源研究报道了更高的得分,但我们的方法以其透明度、可复现性以及极低的推理成本(在现成硬件上平均每项任务仅需约2美分,假设Nvidia 4090 GPU价格为36美分/小时)而脱颖而出。


Scalable LLM Math Reasoning Acceleration with Low-rank Distillation

Abstract

arXiv:2505.07861v1 Announce Type: cross Abstract: Due to long generations, large language model (LLM) math reasoning demands significant computational resources and time. While many existing efficient inference methods have been developed with excellent performance preservation on language tasks, they often severely degrade math performance. In this paper, we propose Caprese, a low-cost distillation method to recover lost capabilities from deploying efficient inference methods, focused primarily in feedforward blocks. With original weights unperturbed, roughly 1% of additional parameters, and only 20K synthetic training samples, we are able to recover much if not all of the math capabilities lost from efficient inference for thinking LLMs and without harm to language tasks for instruct LLMs. Moreover, Caprese slashes the number of active parameters (~2B cut for Gemma 2 9B and Llama 3.1 8B) and integrates cleanly into existing model layers to reduce latency (>11% reduction to generate 2048 tokens with Qwen 2.5 14B) while encouraging response brevity.

摘要

由于生成长度较大,大语言模型(LLM)的数学推理需要消耗大量计算资源和时间。尽管现有诸多高效推理方法在语言任务上能出色保持性能,但它们往往严重削弱数学推理能力。本文提出Caprese——一种低成本的蒸馏方法,主要用于恢复前馈模块因部署高效推理方法而丧失的能力。该方法无需改动原始权重,仅增加约1%的参数总量和2万条合成训练样本,即可为思维型LLM基本恢复数学能力损失(部分模型甚至完全恢复),且对指令型LLM的语言任务无负面影响。此外,Caprese能大幅削减激活参数量(Gemma 2 9B和Llama 3.1 8B模型减少约20亿参数),无缝集成至现有模型层以降低延迟(Qwen 2.5 14B模型生成2048个标记时延迟降低超11%),同时促进响应简洁性。


Joint Detection of Fraud and Concept Drift inOnline Conversations with LLM-Assisted Judgment

Abstract

arXiv:2505.07852v1 Announce Type: cross Abstract: Detecting fake interactions in digital communication platforms remains a challenging and insufficiently addressed problem. These interactions may appear as harmless spam or escalate into sophisticated scam attempts, making it difficult to flag malicious intent early. Traditional detection methods often rely on static anomaly detection techniques that fail to adapt to dynamic conversational shifts. One key limitation is the misinterpretation of benign topic transitions referred to as concept drift as fraudulent behavior, leading to either false alarms or missed threats. We propose a two stage detection framework that first identifies suspicious conversations using a tailored ensemble classification model. To improve the reliability of detection, we incorporate a concept drift analysis step using a One Class Drift Detector (OCDD) to isolate conversational shifts within flagged dialogues. When drift is detected, a large language model (LLM) assesses whether the shift indicates fraudulent manipulation or a legitimate topic change. In cases where no drift is found, the behavior is inferred to be spam like. We validate our framework using a dataset of social engineering chat scenarios and demonstrate its practical advantages in improving both accuracy and interpretability for real time fraud detection. To contextualize the trade offs, we compare our modular approach against a Dual LLM baseline that performs detection and judgment using different language models.

摘要

数字通信平台中的虚假交互检测仍是一个具有挑战性且尚未得到充分解决的问题。这些交互可能表现为无害的垃圾信息,也可能升级为复杂的诈骗企图,导致恶意意图难以及时识别。传统检测方法通常依赖静态异常检测技术,无法适应动态的对话变化。一个关键局限在于将良性的主题转换(即概念漂移)误判为欺诈行为,从而导致误报或漏报。我们提出一个两阶段检测框架:首先通过定制的集成分类模型识别可疑对话;为提高检测可靠性,引入基于单类漂移检测器(OCDD)的概念漂移分析步骤,以隔离被标记对话中的主题转换。当检测到漂移时,由大语言模型(LLM)评估该转换属于欺诈性操纵还是合理的话题变更;若未发现漂移,则推断该行为属于垃圾信息类。我们通过社会工程聊天场景数据集验证该框架,证明其在提升实时欺诈检测准确性与可解释性方面具有实际优势。为权衡性能差异,我们将这种模块化方案与采用不同语言模型进行检测和判定的双LLM基线进行了对比分析。


Efficient Fairness Testing in Large Language Models: Prioritizing Metamorphic Relations for Bias Detection

Abstract

arXiv:2505.07870v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly deployed in various applications, raising critical concerns about fairness and potential biases in their outputs. This paper explores the prioritization of metamorphic relations (MRs) in metamorphic testing as a strategy to efficiently detect fairness issues within LLMs. Given the exponential growth of possible test cases, exhaustive testing is impractical; therefore, prioritizing MRs based on their effectiveness in detecting fairness violations is crucial. We apply a sentence diversity-based approach to compute and rank MRs to optimize fault detection. Experimental results demonstrate that our proposed prioritization approach improves fault detection rates by 22% compared to random prioritization and 12% compared to distance-based prioritization, while reducing the time to the first failure by 15% and 8%, respectively. Furthermore, our approach performs within 5% of fault-based prioritization in effectiveness, while significantly reducing the computational cost associated with fault labeling. These results validate the effectiveness of diversity-based MR prioritization in enhancing fairness testing for LLMs.

摘要

大型语言模型(LLMs)在各种应用中的部署日益广泛,引发了关于其输出公平性和潜在偏见的重要关切。本文探讨了在蜕变测试中优先处理蜕变关系(MRs)的策略,以此作为高效检测LLMs公平性问题的有效方法。鉴于可能测试用例的指数级增长,穷尽式测试并不现实;因此,基于MRs在检测公平性违规方面的有效性对其进行优先级排序至关重要。我们采用基于句子多样性的方法来计算和排序MRs,以优化故障检测。实验结果表明,与随机优先级排序相比,我们提出的优先级排序方法将故障检测率提高了22%;与基于距离的优先级排序相比,提高了12%。同时,该方法将首次故障出现时间分别缩短了15%和8%。此外,我们的方法在有效性上与基于故障的优先级排序相差不超过5%,同时显著降低了与故障标注相关的计算成本。这些结果验证了基于多样性的MR优先级排序在增强LLMs公平性测试方面的有效性。


Scaling Laws for Speculative Decoding

Abstract

arXiv:2505.07858v1 Announce Type: cross Abstract: The escalating demand for efficient decoding in large language models (LLMs) is particularly critical for reasoning-intensive architectures like OpenAI-o3 and DeepSeek-R1, which depend on extended chain-of-thought reasoning. This study investigates speculative decoding techniques through dense LLM architectures to establish foundational insights for accelerating reasoning tasks. While speculative decoding methods leveraging parallel draft-verification cycles have emerged as promising acceleration techniques, the scaling laws governing decoding efficiency remain under-explored compared to conventional backbone LLMs developed through Pretraining->SFT->RLHF training paradigms. In this work, we discover Log-linear Scaling Laws (Theorem 1.1, 1.2 and 1.3) governing draft model acceptance rate (or decoding speed) across three dimensions: pretraining token volume, draft model capacity, and decoding batch size. Building on these laws, we achieve Scylla, which coordinates multi-dimensional scaling for popular LLMs (Llama2/3, Qwen2.5). Empirical validation shows Scylla achieves 1.5-2.2 higher acceptance rate than EAGLE2 and 0.3 higher than EAGLE3 at temperature T = 0, with peak performance gains on summarization and QA tasks (Figure 2). Industrial inference engine deployments demonstrate 2X decoding throughput improvements over EAGLE2 (Table 5), validating the transformative potential of systematic scaling for efficient LLM inference. Code will be released later.

摘要

大型语言模型(LLMs)对高效解码的需求日益增长,这对于依赖长链思维推理的密集型架构(如OpenAI-o3和DeepSeek-R1)尤为关键。本研究通过密集LLM架构探索推测解码技术,为加速推理任务奠定理论基础。尽管利用并行草稿-验证循环的推测解码方法已成为有前景的加速技术,但与通过预训练->监督微调->强化学习人类反馈(Pretraining->SFT->RLHF)训练范式开发的传统骨干LLMs相比,解码效率的缩放规律仍未得到充分探索。本工作发现对数线性缩放规律(定理1.1、1.2和1.3),该规律在三个维度上支配草稿模型接受率(或解码速度):预训练令牌量、草稿模型容量和解码批量大小。基于这些规律,我们实现了Scylla系统,该系统为流行LLMs(Llama2/3、Qwen2.5)协调多维缩放。实证验证表明,在温度T=0时,Scylla的接受率比EAGLE2高1.5-2.2倍,比EAGLE3高0.3倍,在摘要和问答任务上达到峰值性能增益(图2)。工业级推理引擎部署显示,其解码吞吐量较EAGLE2提升2倍(表5),验证了系统化缩放对高效LLM推理的变革潜力。代码将于后续发布。


CrashSage: A Large Language Model-Centered Framework for Contextual and Interpretable Traffic Crash Analysis

Abstract

arXiv:2505.07853v1 Announce Type: cross Abstract: Road crashes claim over 1.3 million lives annually worldwide and incur global economic losses exceeding $1.8 trillion. Such profound societal and financial impacts underscore the urgent need for road safety research that uncovers crash mechanisms and delivers actionable insights. Conventional statistical models and tree ensemble approaches typically rely on structured crash data, overlooking contextual nuances and struggling to capture complex relationships and underlying semantics. Moreover, these approaches tend to incur significant information loss, particularly in narrative elements related to multi-vehicle interactions, crash progression, and rare event characteristics. This study presents CrashSage, a novel Large Language Model (LLM)-centered framework designed to advance crash analysis and modeling through four key innovations. First, we introduce a tabular-to-text transformation strategy paired with relational data integration schema, enabling the conversion of raw, heterogeneous crash data into enriched, structured textual narratives that retain essential structural and relational context. Second, we apply context-aware data augmentation using a base LLM model to improve narrative coherence while preserving factual integrity. Third, we fine-tune the LLaMA3-8B model for crash severity inference, demonstrating superior performance over baseline approaches, including zero-shot, zero-shot with chain-of-thought prompting, and few-shot learning, with multiple models (GPT-4o, GPT-4o-mini, LLaMA3-70B). Finally, we employ a gradient-based explainability technique to elucidate model decisions at both the individual crash level and across broader risk factor dimensions. This interpretability mechanism enhances transparency and enables targeted road safety interventions by providing deeper insights into the most influential factors.

摘要

全球每年因道路交通事故丧生人数超过130万,造成的经济损失高达1.8万亿美元。如此深远的社会经济影响凸显了开展道路安全研究的紧迫性,亟需揭示事故机理并提供可操作的解决方案。传统统计模型与树集成方法通常依赖结构化事故数据,既忽略了上下文细微差异,也难以捕捉复杂关系与潜在语义特征。此外,这些方法往往导致显著的信息损失,特别是在涉及多车交互、事故演变过程及罕见事件特征等叙事性要素方面。本研究提出创新性大语言模型框架CrashSage,通过四项关键技术推动事故分析与建模研究:首先,采用表格式-文本转换策略结合关系数据整合方案,将原始异构事故数据转化为保留关键结构与关系上下文的富文本叙述;其次,基于基础大语言模型实施上下文感知的数据增强,在保持事实准确性的同时提升叙述连贯性;第三,对LLaMA3-8B模型进行事故严重程度推理的微调,其性能显著优于包括零样本、思维链提示零样本以及小样本学习在内的基线方法(测试模型含GPT-4o、GPT-4o-mini及LLaMA3-70B);最后,运用基于梯度的可解释性技术,在单起事故层面和广义风险因素维度上阐释模型决策逻辑。该解释机制通过深度解析最具影响力的风险因素,既增强了模型透明度,也为针对性道路安全干预措施提供了科学依据。


PLHF: Prompt Optimization with Few-Shot Human Feedback

Abstract

arXiv:2505.07886v1 Announce Type: cross Abstract: Automatic prompt optimization frameworks are developed to obtain suitable prompts for large language models (LLMs) with respect to desired output quality metrics. Although existing approaches can handle conventional tasks such as fixed-solution question answering, defining the metric becomes complicated when the output quality cannot be easily assessed by comparisons with standard golden samples. Consequently, optimizing the prompts effectively and efficiently without a clear metric becomes a critical challenge. To address the issue, we present PLHF (which stands for "P"rompt "L"earning with "H"uman "F"eedback), a few-shot prompt optimization framework inspired by the well-known RLHF technique. Different from naive strategies, PLHF employs a specific evaluator module acting as the metric to estimate the output quality. PLHF requires only a single round of human feedback to complete the entire prompt optimization process. Empirical results on both public and industrial datasets show that PLHF outperforms prior output grading strategies for LLM prompt optimizations.

摘要

自动提示优化框架的开发旨在为大型语言模型(LLM)获取适合的提示,以实现期望的输出质量指标。尽管现有方法能够处理诸如固定解答问答等常规任务,但当输出质量无法通过与标准黄金样本的简单比较来评估时,指标的界定便变得复杂。因此,在缺乏明确指标的情况下,如何高效且有效地优化提示成为关键挑战。为解决这一问题,我们提出了PLHF(即基于人类反馈的提示学习),这是一个受知名RLHF技术启发的少样本提示优化框架。与简单策略不同,PLHF采用特定评估器模块作为指标来估计输出质量。PLHF仅需单轮人类反馈即可完成整个提示优化过程。在公开和工业数据集上的实证结果表明,PLHF在LLM提示优化方面优于先前的输出分级策略。


Recovering Event Probabilities from Large Language Model Embeddings via Axiomatic Constraints

Abstract

arXiv:2505.07883v1 Announce Type: cross Abstract: Rational decision-making under uncertainty requires coherent degrees of belief in events. However, event probabilities generated by Large Language Models (LLMs) have been shown to exhibit incoherence, violating the axioms of probability theory. This raises the question of whether coherent event probabilities can be recovered from the embeddings used by the models. If so, those derived probabilities could be used as more accurate estimates in events involving uncertainty. To explore this question, we propose enforcing axiomatic constraints, such as the additive rule of probability theory, in the latent space learned by an extended variational autoencoder (VAE) applied to LLM embeddings. This approach enables event probabilities to naturally emerge in the latent space as the VAE learns to both reconstruct the original embeddings and predict the embeddings of semantically related events. We evaluate our method on complementary events (i.e., event A and its complement, event not-A), where the true probabilities of the two events must sum to 1. Experiment results on open-weight language models demonstrate that probabilities recovered from embeddings exhibit greater coherence than those directly reported by the corresponding models and align closely with the true probabilities.

摘要

不确定性下的理性决策需要事件间具有一致的信念度。然而,大型语言模型(LLMs)生成的事件概率已被证明存在不一致性,违反了概率论的公理。这引发了一个问题:能否从模型使用的嵌入向量中恢复出一致的事件概率?若可行,这些派生概率可作为涉及不确定性事件时更准确的估计。为探究该问题,我们提出在扩展变分自编码器(VAE)学习的潜在空间中强制执行概率论公理约束(如可加性规则),该自编码器作用于LLM的嵌入向量。该方法使事件概率在潜在空间中自然涌现,因为VAE同时学习重构原始嵌入并预测语义相关事件的嵌入。我们在互补事件(即事件A及其补集事件非A)上评估该方法,其中两个事件的真实概率之和必须为1。针对开源权重语言模型的实验结果表明,从嵌入向量恢复的概率比模型直接报告的概率具有更高的一致性,并与真实概率紧密吻合。


Enhanced Urdu Intent Detection with Large Language Models and Prototype-Informed Predictive Pipelines

Abstract

arXiv:2505.07857v1 Announce Type: cross Abstract: Multifarious intent detection predictors are developed for different languages, including English, Chinese and French, however, the field remains underdeveloped for Urdu, the 10th most spoken language. In the realm of well-known languages, intent detection predictors utilize the strategy of few-shot learning and prediction of unseen classes based on the model training on seen classes. However, Urdu language lacks few-shot strategy based intent detection predictors and traditional predictors are focused on prediction of the same classes which models have seen in the train set. To empower Urdu language specific intent detection, this introduces a unique contrastive learning approach that leverages unlabeled Urdu data to re-train pre-trained language models. This re-training empowers LLMs representation learning for the downstream intent detection task. Finally, it reaps the combined potential of pre-trained LLMs and the prototype-informed attention mechanism to create a comprehensive end-to-end LLMPIA intent detection pipeline. Under the paradigm of proposed predictive pipeline, it explores the potential of 6 distinct language models and 13 distinct similarity computation methods. The proposed framework is evaluated on 2 public benchmark datasets, namely ATIS encompassing 5836 samples and Web Queries having 8519 samples. Across ATIS dataset under 4-way 1 shot and 4-way 5 shot experimental settings LLMPIA achieved 83.28% and 98.25% F1-Score and on Web Queries dataset produced 76.23% and 84.42% F1-Score, respectively. In an additional case study on the Web Queries dataset under same classes train and test set settings, LLMPIA outperformed state-of-the-art predictor by 53.55% F1-Score.

摘要

针对英语、汉语和法语等多种语言已开发出多样化的意图检测预测器,然而对于使用量排名第十的乌尔都语,该领域仍处于欠发展状态。在知名语言领域,意图检测预测器普遍采用小样本学习策略,基于已见类别的模型训练来预测未见类别。但乌尔都语目前缺乏基于小样本策略的意图检测预测器,传统预测器仅能识别训练集中出现过的相同类别。为增强乌尔都语专用意图检测能力,本研究提出一种创新的对比学习方法,利用未标注乌尔都语数据对预训练语言模型进行再训练。这种再训练增强了大型语言模型在下游意图检测任务中的表征学习能力。最终,该方法整合预训练大语言模型与原型感知注意力机制的优势,构建出完整的端到端LLMPIA意图检测流程。在所提出的预测流程框架下,本研究探索了6种不同语言模型和13种相似度计算方法的潜力。该框架在ATIS(含5836样本)和Web Queries(含8519样本)两个公共基准数据集上进行评估,在ATIS数据集的4-way 1 shot和4-way 5 shot实验设置下分别获得83.28%和98.25%的F1值,在Web Queries数据集上则取得76.23%和84.42%的F1值。在Web Queries数据集相同类别训练测试集设置的附加案例研究中,LLMPIA以53.55%的F1值优势超越了当前最先进的预测器。


TrumorGPT: Graph-Based Retrieval-Augmented Large Language Model for Fact-Checking

Abstract

arXiv:2505.07891v1 Announce Type: cross Abstract: In the age of social media, the rapid spread of misinformation and rumors has led to the emergence of infodemics, where false information poses a significant threat to society. To combat this issue, we introduce TrumorGPT , a novel generative artificial intelligence solution designed for fact-checking in the health domain. TrumorGPT aims to distinguish "trumors", which are health-related rumors that turn out to be true, providing a crucial tool in differentiating between mere speculation and verified facts. This framework leverages a large language model (LLM) with few-shot learning for semantic health knowledge graph construction and semantic reasoning. TrumorGPT incorporates graph-based retrieval-augmented generation (GraphRAG) to address the hallucination issue common in LLMs and the limitations of static training data. GraphRAG involves accessing and utilizing information from regularly updated semantic health knowledge graphs that consist of the latest medical news and health information, ensuring that fact-checking by TrumorGPT is based on the most recent data. Evaluating with extensive healthcare datasets, TrumorGPT demonstrates superior performance in fact-checking for public health claims. Its ability to effectively conduct fact-checking across various platforms marks a critical step forward in the fight against health-related misinformation, enhancing trust and accuracy in the digital information age.

摘要

在社交媒体时代,错误信息和谣言的快速传播导致了信息疫情的出现,虚假信息对社会构成重大威胁。为应对这一问题,我们推出TrumorGPT——一种专为健康领域事实核查设计的新型生成式人工智能解决方案。该框架旨在识别"真实谣言"(即最终被证实属实的健康相关传言),为区分单纯猜测与已验证事实提供关键工具。TrumorGPT采用具备小样本学习能力的大语言模型(LLM)进行语义健康知识图谱构建与推理,并创新性地引入基于图谱的检索增强生成技术(GraphRAG),以解决LLM常见的幻觉问题和静态训练数据的局限性。GraphRAG通过访问并利用定期更新的语义健康知识图谱(包含最新医疗新闻和健康信息),确保事实核查基于最新数据。经大规模医疗数据集验证,TrumorGPT在公共卫生声明的事实核查方面表现出卓越性能。其跨平台高效实施事实核查的能力,标志着抗击健康错误信息的重大进展,为数字信息时代提升了可信度与准确性。


CellVerse: Do Large Language Models Really Understand Cell Biology?

Abstract

arXiv:2505.07865v1 Announce Type: cross Abstract: Recent studies have demonstrated the feasibility of modeling single-cell data as natural languages and the potential of leveraging powerful large language models (LLMs) for understanding cell biology. However, a comprehensive evaluation of LLMs' performance on language-driven single-cell analysis tasks still remains unexplored. Motivated by this challenge, we introduce CellVerse, a unified language-centric question-answering benchmark that integrates four types of single-cell multi-omics data and encompasses three hierarchical levels of single-cell analysis tasks: cell type annotation (cell-level), drug response prediction (drug-level), and perturbation analysis (gene-level). Going beyond this, we systematically evaluate the performance across 14 open-source and closed-source LLMs ranging from 160M to 671B on CellVerse. Remarkably, the experimental results reveal: (1) Existing specialist models (C2S-Pythia) fail to make reasonable decisions across all sub-tasks within CellVerse, while generalist models such as Qwen, Llama, GPT, and DeepSeek family models exhibit preliminary understanding capabilities within the realm of cell biology. (2) The performance of current LLMs falls short of expectations and has substantial room for improvement. Notably, in the widely studied drug response prediction task, none of the evaluated LLMs demonstrate significant performance improvement over random guessing. CellVerse offers the first large-scale empirical demonstration that significant challenges still remain in applying LLMs to cell biology. By introducing CellVerse, we lay the foundation for advancing cell biology through natural languages and hope this paradigm could facilitate next-generation single-cell analysis.

摘要

近期研究表明,将单细胞数据建模为自然语言具有可行性,并揭示了利用强大大型语言模型(LLM)理解细胞生物学的潜力。然而,目前仍缺乏对LLM在语言驱动的单细胞分析任务中性能的系统评估。基于此挑战,我们提出CellVerse——一个统一的以语言为中心的问答基准,该基准整合了四种单细胞多组学数据类型,涵盖三个层级的单细胞分析任务:细胞类型注释(细胞层面)、药物反应预测(药物层面)和扰动分析(基因层面)。更进一步,我们系统评估了14个开源与闭源LLM(参数量从1.6亿到6710亿不等)在CellVerse上的表现。实验结果表明:(1)现有专业模型(C2S-Pythia)无法在CellVerse所有子任务中做出合理决策,而通用模型如Qwen、Llama、GPT和DeepSeek系列模型展现出对细胞生物学领域的初步理解能力;(2)当前LLM的表现与预期存在显著差距,具有较大改进空间。值得注意的是,在广泛研究的药物反应预测任务中,所有评估的LLM均未表现出显著优于随机猜测的性能提升。CellVerse首次通过大规模实证表明,LLM应用于细胞生物学仍面临重大挑战。通过建立CellVerse,我们为通过自然语言推进细胞生物学研究奠定基础,并期望该范式能促进新一代单细胞分析的发展。


Bridging Large Language Models and Single-Cell Transcriptomics in Dissecting Selective Motor Neuron Vulnerability

Abstract

arXiv:2505.07896v1 Announce Type: cross Abstract: Understanding cell identity and function through single-cell level sequencing data remains a key challenge in computational biology. We present a novel framework that leverages gene-specific textual annotations from the NCBI Gene database to generate biologically contextualized cell embeddings. For each cell in a single-cell RNA sequencing (scRNA-seq) dataset, we rank genes by expression level, retrieve their NCBI Gene descriptions, and transform these descriptions into vector embedding representations using large language models (LLMs). The models used include OpenAI text-embedding-ada-002, text-embedding-3-small, and text-embedding-3-large (Jan 2024), as well as domain-specific models BioBERT and SciBERT. Embeddings are computed via an expression-weighted average across the top N most highly expressed genes in each cell, providing a compact, semantically rich representation. This multimodal strategy bridges structured biological data with state-of-the-art language modeling, enabling more interpretable downstream applications such as cell-type clustering, cell vulnerability dissection, and trajectory inference.

摘要

通过单细胞水平测序数据理解细胞身份和功能仍是计算生物学领域的核心挑战。本研究提出一种创新框架,利用NCBI基因数据库中基因特异性文本注释来生成具有生物上下文特征的细胞嵌入表示。针对单细胞RNA测序(scRNA-seq)数据集中的每个细胞,我们按表达水平对基因进行排序,获取其NCBI基因描述,并运用大型语言模型(LLMs)将这些描述转化为向量嵌入表示。采用的模型包括OpenAI的text-embedding-ada-002、text-embedding-3-small和text-embedding-3-large(2024年1月版),以及领域专用模型BioBERT和SciBERT。通过计算每个细胞中前N个高表达基因的表达加权平均嵌入,获得紧凑且语义丰富的表征。这种多模态策略将结构化生物数据与前沿语言建模技术相融合,可支持更具可解释性的下游应用,如细胞类型聚类、细胞脆弱性解析和轨迹推断。


Evaluating Financial Sentiment Analysis with Annotators Instruction Assisted Prompting: Enhancing Contextual Interpretation and Stock Prediction Accuracy

Abstract

arXiv:2505.07871v1 Announce Type: cross Abstract: Financial sentiment analysis (FSA) presents unique challenges to LLMs that surpass those in typical sentiment analysis due to the nuanced language used in financial contexts. The prowess of these models is often undermined by the inherent subjectivity of sentiment classifications in existing benchmark datasets like Financial Phrasebank. These datasets typically feature undefined sentiment classes that reflect the highly individualized perspectives of annotators, leading to significant variability in annotations. This variability results in an unfair expectation for LLMs during benchmarking, where they are tasked to conjecture the subjective viewpoints of human annotators without sufficient context. In this paper, we introduce the Annotators' Instruction Assisted Prompt, a novel evaluation prompt designed to redefine the task definition of FSA for LLMs. By integrating detailed task instructions originally intended for human annotators into the LLMs' prompt framework, AIAP aims to standardize the understanding of sentiment across both human and machine interpretations, providing a fair and context-rich foundation for sentiment analysis. We utilize a new dataset, WSBS, derived from the WallStreetBets subreddit to demonstrate how AIAP significantly enhances LLM performance by aligning machine operations with the refined task definitions. Experimental results demonstrate that AIAP enhances LLM performance significantly, with improvements up to 9.08. This context-aware approach not only yields incremental gains in performance but also introduces an innovative sentiment-indexing method utilizing model confidence scores. This method enhances stock price prediction models and extracts more value from the financial sentiment analysis, underscoring the significance of WSB as a critical source of financial text. Our research offers insights into both improving FSA through better evaluation methods.

摘要

金融情感分析(FSA)对大型语言模型(LLM)提出了超越常规情感分析的独特挑战,这源于金融语境中微妙的语言表达。现有基准数据集(如Financial Phrasebank)中情感分类固有的主观性往往削弱了这些模型的性能,这些数据集通常包含反映标注者高度个人化观点的未定义情感类别,导致标注结果存在显著差异。这种差异性在基准测试中对LLM提出了不公平要求,即模型需在缺乏充分上下文的情况下推测人类标注者的主观观点。本文提出"标注者指令辅助提示"(AIAP),这是一种旨在为LLM重新定义FSA任务的新型评估提示。通过将原本为人类标注者设计的详细任务指令整合到LLM提示框架中,AIAP试图标准化人类与机器对情感理解的统一,为情感分析提供公平且富含上下文的评估基础。我们采用源自WallStreetBets子论坛的新数据集WSBS进行实验,证明AIAP通过使机器操作与精细化任务定义保持一致,显著提升了LLM性能。实验结果表明AIAP可使LLM性能提升高达9.08%。这种上下文感知方法不仅带来性能的渐进式提升,还创新性地提出利用模型置信度得分的情感索引方法。该方法增强了股价预测模型,并从金融情感分析中提取出更大价值,凸显了WSB作为金融文本关键来源的重要性。本研究为通过改进评估方法提升FSA性能提供了新的见解。


Implementing Long Text Style Transfer with LLMs through Dual-Layered Sentence and Paragraph Structure Extraction and Mapping

Abstract

arXiv:2505.07888v1 Announce Type: cross Abstract: This paper addresses the challenge in long-text style transfer using zero-shot learning of large language models (LLMs), proposing a hierarchical framework that combines sentence-level stylistic adaptation with paragraph-level structural coherence. We argue that in the process of effective paragraph-style transfer, to preserve the consistency of original syntactic and semantic information, it is essential to perform style transfer not only at the sentence level but also to incorporate paragraph-level semantic considerations, while ensuring structural coherence across inter-sentential relationships. Our proposed framework, ZeroStylus, operates through two systematic phases: hierarchical template acquisition from reference texts and template-guided generation with multi-granular matching. The framework dynamically constructs sentence and paragraph template repositories, enabling context-aware transformations while preserving inter-sentence logical relationships. Experimental evaluations demonstrate significant improvements over baseline methods, with structured rewriting achieving 6.90 average score compared to 6.70 for direct prompting approaches in tri-axial metrics assessing style consistency, content preservation, and expression quality. Ablation studies validate the necessity of both template hierarchies during style transfer, showing higher content preservation win rate against sentence-only approaches through paragraph-level structural encoding, as well as direct prompting method through sentence-level pattern extraction and matching. The results establish new capabilities for coherent long-text style transfer without requiring parallel corpora or LLM fine-tuning.

摘要

本文针对大语言模型(LLM)零样本学习在长文本风格迁移中的挑战,提出了一种结合句子级风格适配与段落级结构连贯性的分层框架。我们主张,在实现有效的段落风格迁移过程中,为保持原始句法语义信息的一致性,不仅需要在句子层面执行风格转换,还需融入段落级语义考量,同时确保句际关系的结构连贯性。提出的ZeroStylus框架通过两个系统阶段运作:从参考文本中分层获取模板,以及基于多粒度匹配的模板引导生成。该框架动态构建句子和段落模板库,在保持句间逻辑关系的同时实现上下文感知的文本转换。实验评估表明,在评估风格一致性、内容保留和表达质量的三维指标中,本框架的结构化重写方法以6.90的平均分显著优于直接提示法的6.70。消融研究验证了双重模板层级的必要性:通过段落级结构编码相比纯句子方法获得更高的内容保留胜率,同时通过句子级模式提取匹配也优于直接提示法。该成果为无需平行语料或LLM微调的连贯长文本风格迁移建立了新范式。


Efficient Telecom Specific LLM: TSLAM-Mini with QLoRA and Digital Twin Data

Abstract

arXiv:2505.07877v1 Announce Type: cross Abstract: General-purpose large language models (LLMs), despite their broad capabilities accrued from open-world data, frequently exhibit suboptimal performance when confronted with the nuanced and specialized demands inherent in real-time telecommunications applications. This investigation addresses this critical limitation through the meticulous fine-tuning of TSLAM-Mini developed by NetoAI, a compact (3.8-billion parameter) causal language model architecturally derived from Phi-4 Mini Instruct 4B. The fine-tuning regimen leverages a bespoke dataset comprising 100,000 samples, strategically engineered to address 20 pivotal telecommunications use-cases, encompassing domains such as Network Fundamentals, IP Routing, MPLS, Network Security, Automation, OSS/BSS, RAN, Mobile Core, Satellite Communications, and Ethical AI. This dataset was curated utilizing NetoAI's DigiTwin platform, enriched with granular insights from venerated network Subject Matter Experts (SMEs) and authoritative RFC documents, thereby capturing high-fidelity representations of real-world network dynamics through simulations inspired by digital twin paradigms. Employing Quantized Low-Rank Adaptation (QLoRA), a state-of-the-art Parameter Efficient Fine-Tuning (PEFT) technique, we achieved substantial training efficiency and enabled prospective deployment on resource-constrained hardware. A novel evaluation framework, predicated on a high-capacity LLM (Qwen3-235B-A22B) functioning as an automated adjudicator, was instituted to rigorously assess instruction-following fidelity and response quality across the specified telecom use-cases. Empirical results unequivocally demonstrate TSLAM-Mini's superior aptitude in telecom-centric applications, underscoring the profound efficacy of domain-specific datasets and PEFT methodologies for advancing intelligent network management.

摘要

通用大语言模型(LLMs)虽然通过开放世界数据获得了广泛能力,但在面对实时电信应用中固有的细微专业需求时,往往表现欠佳。本研究通过精心微调NetoAI开发的TSLAM-Mini(一个基于Phi-4 Mini Instruct 4B架构、具有38亿参数的紧凑型因果语言模型),解决了这一关键局限。微调过程采用包含10万样本的定制数据集,该数据集针对20个关键电信用例进行战略设计,涵盖网络基础、IP路由、MPLS、网络安全、自动化、OSS/BSS、无线接入网、移动核心网、卫星通信及伦理人工智能等领域。数据集通过NetoAI的DigiTwin平台构建,并融合了网络领域专家(SMEs)的深度洞察和权威RFC文档,借助数字孪生范式启发的模拟技术,实现了对真实网络动态的高保真表征。采用量化低秩自适应(QLoRA)这一先进参数高效微调(PEFT)技术,我们显著提升了训练效率,并实现了在资源受限硬件上的前瞻性部署。研究还建立了一个基于高性能LLM(Qwen3-235B-A22B)的新型评估框架作为自动裁决器,用于严格评估模型在指定电信用例中的指令遵循精度和响应质量。实证结果明确表明TSLAM-Mini在电信领域应用中的卓越性能,印证了领域专用数据集与PEFT方法对推进智能网络管理的显著成效。


DeltaEdit: Enhancing Sequential Editing in Large Language Models by Controlling Superimposed Noise

Abstract

arXiv:2505.07899v1 Announce Type: cross Abstract: Sequential knowledge editing techniques aim to continuously update the knowledge in large language models at a low cost, preventing the models from generating outdated or incorrect information. However, existing sequential editing methods suffer from a significant decline in editing success rates after long-term editing. Through theoretical analysis and experiments, we identify that as the number of edits increases, the model's output increasingly deviates from the desired target, leading to a drop in editing success rates. We refer to this issue as the accumulation of superimposed noise problem. To address this, we identify the factors contributing to this deviation and propose DeltaEdit, a novel method that optimizes update parameters through a dynamic orthogonal constraints strategy, effectively reducing interference between edits to mitigate deviation. Experimental results demonstrate that DeltaEdit significantly outperforms existing methods in edit success rates and the retention of generalization capabilities, ensuring stable and reliable model performance even under extensive sequential editing.

摘要

序列化知识编辑技术旨在以低成本持续更新大语言模型中的知识,防止模型生成过时或错误信息。然而现有序列编辑方法在长期编辑后会出现编辑成功率显著下降的问题。通过理论分析与实验验证,我们发现随着编辑次数增加,模型输出与期望目标的偏差逐渐累积,导致编辑成功率下降。这一问题被定义为叠加噪声累积效应。为此,我们解析了导致偏差的关键因素,并提出DeltaEdit方法——通过动态正交约束策略优化更新参数,有效减少编辑间相互干扰以抑制偏差。实验结果表明,DeltaEdit在编辑成功率和泛化能力保持方面显著优于现有方法,即使在大规模序列编辑下也能确保模型性能稳定可靠。


LongCodeBench: Evaluating Coding LLMs at 1M Context Windows

Abstract

arXiv:2505.07897v1 Announce Type: cross Abstract: Context lengths for models have grown rapidly, from thousands to millions of tokens in just a few years. The extreme context sizes of modern long-context models have made it difficult to construct realistic long-context benchmarks -- not only due to the cost of collecting million-context tasks but also in identifying realistic scenarios that require significant contexts. We identify code comprehension and repair as a natural testbed and challenge task for long-context models and introduce LongCodeBench (LCB), a benchmark to test LLM coding abilities in long-context scenarios. Our benchmark tests both the comprehension and repair capabilities of LCLMs in realistic and important settings by drawing from real-world GitHub issues and constructing QA (LongCodeQA) and bug fixing (LongSWE-Bench) tasks. We carefully stratify the complexity of our benchmark, enabling us to evaluate models across different scales -- ranging from Qwen2.5 14B Instruct to Google's flagship Gemini model. We find that long-context remains a weakness for all models, with performance drops such as from 29% to 3% for Claude 3.5 Sonnet, or from 70.2% to 40% for Qwen2.5.

摘要

模型上下文长度在短短几年内从数千个标记快速增长至数百万。现代长上下文模型的极端上下文规模使得构建真实的长上下文基准测试变得困难——不仅因为收集百万级上下文任务的成本高昂,还在于识别需要大量上下文的真实场景。我们提出代码理解与修复作为长上下文模型的天然测试平台和挑战任务,并推出LongCodeBench(LCB)这一基准测试,用于评估大语言模型在长上下文场景中的编码能力。该基准通过提取真实GitHub问题构建问答(LongCodeQA)和缺陷修复(LongSWE-Bench)任务,测试长上下文语言模型在真实重要场景下的理解与修复能力。我们精心分层设计基准复杂度,从而能够评估从Qwen2.5 14B Instruct到谷歌旗舰Gemini模型等不同规模的模型。研究发现长上下文仍是所有模型的薄弱环节,性能下降显著:如Claude 3.5 Sonnet从29%降至3%,Qwen2.5从70.2%跌至40%。


SEM: Reinforcement Learning for Search-Efficient Large Language Models

Abstract

arXiv:2505.07903v1 Announce Type: cross Abstract: Recent advancements in Large Language Models(LLMs) have demonstrated their capabilities not only in reasoning but also in invoking external tools, particularly search engines. However, teaching models to discern when to invoke search and when to rely on their internal knowledge remains a significant challenge. Existing reinforcement learning approaches often lead to redundant search behaviors, resulting in inefficiencies and over-cost. In this paper, we propose SEM, a novel post-training reinforcement learning framework that explicitly trains LLMs to optimize search usage. By constructing a balanced dataset combining MuSiQue and MMLU, we create scenarios where the model must learn to distinguish between questions it can answer directly and those requiring external retrieval. We design a structured reasoning template and employ Group Relative Policy Optimization(GRPO) to post-train the model's search behaviors. Our reward function encourages accurate answering without unnecessary search while promoting effective retrieval when needed. Experimental results demonstrate that our method significantly reduces redundant search operations while maintaining or improving answer accuracy across multiple challenging benchmarks. This framework advances the model's reasoning efficiency and extends its capability to judiciously leverage external knowledge.

摘要

大语言模型(LLMs)的最新进展不仅展现了其推理能力,还证明了其调用外部工具(尤其是搜索引擎)的潜力。然而,如何使模型学会判断何时需调用搜索、何时可依赖内部知识仍是一项重大挑战。现有强化学习方法常导致冗余搜索行为,造成效率低下与成本过高。本文提出SEM——一种新颖的微调后强化学习框架,通过显式训练优化LLMs的搜索使用机制。通过整合MuSiQue和MMLU构建平衡数据集,我们创设了模型必须学会区分可直接回答与需外部检索问题的场景。设计结构化推理模板并采用组相对策略优化(GRPO)对模型搜索行为进行微调后训练,所设计的奖励函数在避免不必要搜索的同时确保答案准确性,并在需要时促进有效检索。实验结果表明,该方法在多个高难度基准测试中显著减少冗余搜索操作,同时保持或提升回答准确率。该框架不仅提升了模型推理效率,更扩展了其审慎利用外部知识的能力。


FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning

Abstract

arXiv:2505.08054v1 Announce Type: cross Abstract: Safety alignment approaches in large language models (LLMs) often lead to the over-refusal of benign queries, significantly diminishing their utility in sensitive scenarios. To address this challenge, we introduce FalseReject, a comprehensive resource containing 16k seemingly toxic queries accompanied by structured responses across 44 safety-related categories. We propose a graph-informed adversarial multi-agent interaction framework to generate diverse and complex prompts, while structuring responses with explicit reasoning to aid models in accurately distinguishing safe from unsafe contexts. FalseReject includes training datasets tailored for both standard instruction-tuned models and reasoning-oriented models, as well as a human-annotated benchmark test set. Our extensive benchmarking on 29 state-of-the-art (SOTA) LLMs reveals persistent over-refusal challenges. Empirical results demonstrate that supervised finetuning with FalseReject substantially reduces unnecessary refusals without compromising overall safety or general language capabilities.

摘要

大型语言模型(LLM)的安全对齐方法常导致对良性查询的过度拒绝,显著降低了其在敏感场景中的实用性。为解决这一问题,我们提出FalseReject——一个包含16k个表面敏感查询的综合资源库,这些查询覆盖44个安全相关类别并配有结构化响应。我们设计了一种基于图结构的对抗性多智能体交互框架,用于生成多样化的复杂提示,同时通过显式推理构建响应,以帮助模型准确区分安全与不安全语境。FalseReject包含为标准指令调优模型和推理导向模型定制的训练数据集,以及人工标注的基准测试集。通过对29个前沿LLM的广泛测试,我们发现了持续存在的过度拒绝问题。实证结果表明,使用FalseReject进行监督微调可在不损害整体安全性或通用语言能力的前提下,显著减少不必要的拒绝行为。


Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders

Abstract

arXiv:2505.08080v1 Announce Type: cross Abstract: Sparse Autoencoders (SAEs) have recently emerged as powerful tools for interpreting and steering the internal representations of large language models (LLMs). However, conventional approaches to analyzing SAEs typically rely solely on input-side activations, without considering the causal influence between each latent feature and the model's output. This work is built on two key hypotheses: (1) activated latents do not contribute equally to the construction of the model's output, and (2) only latents with high causal influence are effective for model steering. To validate these hypotheses, we propose Gradient Sparse Autoencoder (GradSAE), a simple yet effective method that identifies the most influential latents by incorporating output-side gradient information.

摘要

稀疏自编码器(SAE)近期已成为解释和调控大型语言模型(LLM)内部表征的重要工具。然而,传统SAE分析方法通常仅依赖输入侧激活值,而未考虑各潜在特征与模型输出间的因果影响。本研究基于两个关键假设:(1)被激活的潜在特征对模型输出的构建贡献不均等;(2)仅具有高因果影响的潜在特征能有效实现模型调控。为验证这些假设,我们提出梯度稀疏自编码器(GradSAE),该方法通过融合输出侧梯度信息来识别最具影响力的潜在特征,其实现简单却高效。


Large Language Models and Arabic Content: A Review

Abstract

arXiv:2505.08004v1 Announce Type: cross Abstract: Over the past three years, the rapid advancement of Large Language Models (LLMs) has had a profound impact on multiple areas of Artificial Intelligence (AI), particularly in Natural Language Processing (NLP) across diverse languages, including Arabic. Although Arabic is considered one of the most widely spoken languages across 27 countries in the Arabic world and used as a second language in some other non-Arabic countries as well, there is still a scarcity of Arabic resources, datasets, and tools. Arabic NLP tasks face various challenges due to the complexities of the Arabic language, including its rich morphology, intricate structure, and diverse writing standards, among other factors. Researchers have been actively addressing these challenges, demonstrating that pre-trained Large Language Models (LLMs) trained on multilingual corpora achieve significant success in various Arabic NLP tasks. This study provides an overview of using large language models (LLMs) for the Arabic language, highlighting early pre-trained Arabic Language models across various NLP applications and their ability to handle diverse Arabic content tasks and dialects. It also provides an overview of how techniques like finetuning and prompt engineering can enhance the performance of these models. Additionally, the study summarizes common Arabic benchmarks and datasets while presenting our observations on the persistent upward trend in the adoption of LLMs.

摘要

过去三年间,大型语言模型(LLMs)的快速发展对人工智能(AI)多个领域产生了深远影响,尤其在涵盖阿拉伯语等多样语言的自然语言处理(NLP)方面。尽管阿拉伯语作为阿拉伯世界27个国家的通用语言,并在部分非阿拉伯国家作为第二语言使用,其相关资源、数据集及工具仍显匮乏。由于阿拉伯语复杂的形态结构、繁复的语法体系及多样化的书写标准等语言特性,阿拉伯语NLP任务面临诸多挑战。研究者们正积极应对这些挑战,研究表明基于多语言语料库预训练的大型语言模型在各类阿拉伯语NLP任务中成效显著。本研究系统综述了大型语言模型在阿拉伯语领域的应用,重点探讨了早期预训练阿拉伯语模型在各类NLP应用中的表现及其处理多样化阿拉伯语内容任务和方言的能力,同时阐明了微调技术和提示工程如何提升模型性能。此外,研究还汇总了常见的阿拉伯语基准测试与数据集,并就LLMs应用持续增长的趋势提出了我们的观察结论。


Leveraging AI for Productive and Trustworthy HPC Software: Challenges and Research Directions

Abstract

arXiv:2505.08135v1 Announce Type: cross Abstract: We discuss the challenges and propose research directions for using AI to revolutionize the development of high-performance computing (HPC) software. AI technologies, in particular large language models, have transformed every aspect of software development. For its part, HPC software is recognized as a highly specialized scientific field of its own. We discuss the challenges associated with leveraging state-of-the-art AI technologies to develop such a unique and niche class of software and outline our research directions in the two US Department of Energy--funded projects for advancing HPC Software via AI: Ellora and Durban.

摘要

我们探讨了利用人工智能革新高性能计算(HPC)软件开发所面临的挑战,并提出了相关研究方向。人工智能技术,尤其是大语言模型,已经改变了软件开发的各个层面。而HPC软件本身作为一个高度专业化的科学领域也备受认可。本文讨论了如何运用最先进AI技术开发这类独特小众软件的挑战,并概述了美国能源部资助的两个通过AI推进HPC软件的研究项目(Ellora与Durban)中的研究方向。


Re2^2: A Consistency-ensured Dataset for Full-stage Peer Review and Multi-turn Rebuttal Discussions

Abstract

arXiv:2505.07920v1 Announce Type: cross Abstract: Peer review is a critical component of scientific progress in the fields like AI, but the rapid increase in submission volume has strained the reviewing system, which inevitably leads to reviewer shortages and declines review quality. Besides the growing research popularity, another key factor in this overload is the repeated resubmission of substandard manuscripts, largely due to the lack of effective tools for authors to self-evaluate their work before submission. Large Language Models (LLMs) show great promise in assisting both authors and reviewers, and their performance is fundamentally limited by the quality of the peer review data. However, existing peer review datasets face three major limitations: (1) limited data diversity, (2) inconsistent and low-quality data due to the use of revised rather than initial submissions, and (3) insufficient support for tasks involving rebuttal and reviewer-author interactions. To address these challenges, we introduce the largest consistency-ensured peer review and rebuttal dataset named Re^2, which comprises 19,926 initial submissions, 70,668 review comments, and 53,818 rebuttals from 24 conferences and 21 workshops on OpenReview. Moreover, the rebuttal and discussion stage is framed as a multi-turn conversation paradigm to support both traditional static review tasks and dynamic interactive LLM assistants, providing more practical guidance for authors to refine their manuscripts and helping alleviate the growing review burden. Our data and code are available in https://anonymous.4open.science/r/ReviewBench_anon/.

摘要

同行评审是人工智能等领域科学进步的关键环节,但投稿量的快速增长使得评审系统不堪重负,这不可避免地导致评审人员短缺和评审质量下降。除研究热度攀升外,造成这种过载的另一关键因素是不达标稿件的反复提交,其主要原因是作者缺乏有效的工具在投稿前进行自我评估。大语言模型在协助作者和审稿人方面展现出巨大潜力,但其性能从根本上受限于同行评审数据的质量。然而现有评审数据集存在三大局限:(1) 数据多样性不足;(2) 因使用修订版而非初始投稿导致数据不一致且质量低下;(3) 对涉及反驳和审稿人与作者交互任务的支持不足。为解决这些问题,我们推出了规模最大且确保一致性的评审与反驳数据集Re²,包含OpenReview平台上24个会议和21个研讨会的19,926篇初始投稿、70,668条评审意见和53,818份反驳内容。此外,我们将反驳与讨论阶段构建为多轮对话范式,既能支持传统的静态评审任务,也能服务于动态交互式大语言模型助手,为作者完善稿件提供更具实践性的指导,并帮助缓解日益增长的评审压力。数据与代码详见https://anonymous.4open.science/r/ReviewBench_anon/。


ALOHA: Empowering Multilingual Agent for University Orientation with Hierarchical Retrieval

Abstract

arXiv:2505.08130v1 Announce Type: cross Abstract: The rise of Large Language Models~(LLMs) revolutionizes information retrieval, allowing users to obtain required answers through complex instructions within conversations. However, publicly available services remain inadequate in addressing the needs of faculty and students to search campus-specific information. It is primarily due to the LLM's lack of domain-specific knowledge and the limitation of search engines in supporting multilingual and timely scenarios. To tackle these challenges, we introduce ALOHA, a multilingual agent enhanced by hierarchical retrieval for university orientation. We also integrate external APIs into the front-end interface to provide interactive service. The human evaluation and case study show our proposed system has strong capabilities to yield correct, timely, and user-friendly responses to the queries in multiple languages, surpassing commercial chatbots and search engines. The system has been deployed and has provided service for more than 12,000 people.

摘要

大型语言模型(LLM)的兴起彻底改变了信息检索方式,使得用户能够通过对话中的复杂指令获取所需答案。然而,现有公开服务仍无法充分满足师生检索校园特定信息的需求,这主要源于LLM缺乏领域专业知识,以及搜索引擎在多语言支持和实时场景中的局限性。为解决这些问题,我们提出了ALOHA——一个通过分层检索增强的多语言校园导览智能体。该系统将外部API集成至前端界面以提供交互服务。人工评估与案例研究表明,我们所提出的系统能够以多语言生成准确、及时且用户友好的查询响应,其表现优于商业聊天机器人和搜索引擎。该系统已部署运行,并为超过12,000人提供了服务。


Are LLMs complicated ethical dilemma analyzers?

Abstract

arXiv:2505.08106v1 Announce Type: cross Abstract: One open question in the study of Large Language Models (LLMs) is whether they can emulate human ethical reasoning and act as believable proxies for human judgment. To investigate this, we introduce a benchmark dataset comprising 196 real-world ethical dilemmas and expert opinions, each segmented into five structured components: Introduction, Key Factors, Historical Theoretical Perspectives, Resolution Strategies, and Key Takeaways. We also collect non-expert human responses for comparison, limited to the Key Factors section due to their brevity. We evaluate multiple frontier LLMs (GPT-4o-mini, Claude-3.5-Sonnet, Deepseek-V3, Gemini-1.5-Flash) using a composite metric framework based on BLEU, Damerau-Levenshtein distance, TF-IDF cosine similarity, and Universal Sentence Encoder similarity. Metric weights are computed through an inversion-based ranking alignment and pairwise AHP analysis, enabling fine-grained comparison of model outputs to expert responses. Our results show that LLMs generally outperform non-expert humans in lexical and structural alignment, with GPT-4o-mini performing most consistently across all sections. However, all models struggle with historical grounding and proposing nuanced resolution strategies, which require contextual abstraction. Human responses, while less structured, occasionally achieve comparable semantic similarity, suggesting intuitive moral reasoning. These findings highlight both the strengths and current limitations of LLMs in ethical decision-making.

摘要

大型语言模型(LLMs)研究中的一个开放性问题在于其能否模拟人类伦理推理并作为人类判断的可信代理。为探究此问题,我们引入了一个包含196个现实世界伦理困境与专家意见的基准数据集,每个案例被划分为五个结构化部分:引言、关键因素、历史理论视角、解决策略和核心要点。由于非专家人类回答的简洁性,我们仅收集其在关键因素部分的响应用于对比。通过融合BLEU、Damerau-Levenshtein距离、TF-IDF余弦相似度和通用语句编码器相似度的复合指标框架,我们评估了多个前沿LLM(GPT-4o-mini、Claude-3.5-Sonnet、Deepseek-V3、Gemini-1.5-Flash)。指标权重通过基于逆序的排名对齐和成对层次分析法计算得出,实现了模型输出与专家响应的细粒度对比。研究结果表明,LLM在词汇和结构对齐方面普遍优于非专家人类,其中GPT-4o-mini在所有部分表现最为稳定。然而,所有模型在历史依据和需要情境抽象的精细化解决策略提议方面均存在困难。人类回答虽结构性较弱,但偶尔能达到相当的语义相似度,暗示其直觉式道德推理能力。这些发现既揭示了LLM在伦理决策中的优势,也凸显了其当前局限性。


Aitomia: Your Intelligent Assistant for AI-Driven Atomistic and Quantum Chemical Simulations

Abstract

arXiv:2505.08195v1 Announce Type: cross Abstract: We have developed Aitomia - a platform powered by AI to assist in performing AI-driven atomistic and quantum chemical (QC) simulations. This intelligent assistant platform is equipped with chatbots and AI agents to help experts and guide non-experts in setting up and running the atomistic simulations, monitoring their computation status, analyzing the simulation results, and summarizing them for the user in text and graphical forms. We achieve these goals by exploiting fine-tuned open-source large language models (LLMs), rule-based agents, and a retrieval-augmented generation (RAG) system. Aitomia leverages the versatility of our MLatom ecosystem for AI-enhanced computational chemistry. This intelligent assistant is going to be integrated into the Aitomistic Hub and XACS online computing services, with some functionality already publicly available as described at http://mlatom.com/aitomia. Aitomia is expected to lower the barrier to performing atomistic simulations, accelerating research and development in the relevant fields.

摘要

我们开发了Aitomia——一个由人工智能驱动的平台,用于辅助执行AI驱动的原子尺度与量子化学(QC)模拟。该智能辅助平台配备聊天机器人和AI代理,可协助专家并指导非专业人士完成原子模拟的建立与运行、计算状态监控、结果分析,并以文本和图形形式为用户总结成果。我们通过微调的开源大语言模型(LLMs)、基于规则的代理以及检索增强生成(RAG)系统实现这些功能。Aitomia充分利用了MLatom生态系统在AI增强计算化学中的多功能性。该智能助手将被集成至Aitomistic Hub和XACS在线计算服务中,部分功能已通过http://mlatom.com/aitomia公开提供。Aitomia有望降低原子尺度模拟的实施门槛,加速相关领域的研发进程。


Communication Styles and Reader Preferences of LLM and Human Experts in Explaining Health Information

Abstract

arXiv:2505.08143v1 Announce Type: cross Abstract: With the wide adoption of large language models (LLMs) in information assistance, it is essential to examine their alignment with human communication styles and values. We situate this study within the context of fact-checking health information, given the critical challenge of rectifying conceptions and building trust. Recent studies have explored the potential of LLM for health communication, but style differences between LLMs and human experts and associated reader perceptions remain under-explored. In this light, our study evaluates the communication styles of LLMs, focusing on how their explanations differ from those of humans in three core components of health communication: information, sender, and receiver. We compiled a dataset of 1498 health misinformation explanations from authoritative fact-checking organizations and generated LLM responses to inaccurate health information. Drawing from health communication theory, we evaluate communication styles across three key dimensions of information linguistic features, sender persuasive strategies, and receiver value alignments. We further assessed human perceptions through a blinded evaluation with 99 participants. Our findings reveal that LLM-generated articles showed significantly lower scores in persuasive strategies, certainty expressions, and alignment with social values and moral foundations. However, human evaluation demonstrated a strong preference for LLM content, with over 60% responses favoring LLM articles for clarity, completeness, and persuasiveness. Our results suggest that LLMs' structured approach to presenting information may be more effective at engaging readers despite scoring lower on traditional measures of quality in fact-checking and health communication.

摘要

随着大型语言模型(LLMs)在信息辅助领域的广泛应用,检验其与人类沟通方式及价值观的契合度变得至关重要。本研究以健康信息核查为背景,着眼于纠正错误认知与建立信任这一关键挑战。尽管近期研究探索了LLMs在健康传播中的潜力,但对其与人类专家在表达风格上的差异及读者感知的研究仍显不足。为此,我们评估了LLMs的沟通风格,重点分析其在健康传播三大核心要素(信息、发送者、接收者)中与人类解释的差异。我们收集了权威核查机构发布的1498条健康谣言解释数据,并生成LLMs对错误健康信息的回应。基于健康传播理论,我们从信息语言特征、发送者说服策略、接收者价值认同三个维度评估沟通风格差异,并通过99名参与者的盲测评估人类感知。研究发现:LLM生成文章在说服策略、确定性表达、社会价值及道德基础契合度上得分显著较低;但人类评估显示超过60%的参与者更青睐LLM内容,认为其清晰度、完整性和说服力更优。结果表明,尽管LLMs在传统质量指标上得分较低,但其结构化的信息呈现方式可能更有效吸引读者。


A Large-Scale Empirical Analysis of Custom GPTs' Vulnerabilities in the OpenAI Ecosystem

Abstract

arXiv:2505.08148v1 Announce Type: cross Abstract: Millions of users leverage generative pretrained transformer (GPT)-based language models developed by leading model providers for a wide range of tasks. To support enhanced user interaction and customization, many platforms-such as OpenAI-now enable developers to create and publish tailored model instances, known as custom GPTs, via dedicated repositories or application stores. These custom GPTs empower users to browse and interact with specialized applications designed to meet specific needs. However, as custom GPTs see growing adoption, concerns regarding their security vulnerabilities have intensified. Existing research on these vulnerabilities remains largely theoretical, often lacking empirical, large-scale, and statistically rigorous assessments of associated risks. In this study, we analyze 14,904 custom GPTs to assess their susceptibility to seven exploitable threats, such as roleplay-based attacks, system prompt leakage, phishing content generation, and malicious code synthesis, across various categories and popularity tiers within the OpenAI marketplace. We introduce a multi-metric ranking system to examine the relationship between a custom GPT's popularity and its associated security risks. Our findings reveal that over 95% of custom GPTs lack adequate security protections. The most prevalent vulnerabilities include roleplay-based vulnerabilities (96.51%), system prompt leakage (92.20%), and phishing (91.22%). Furthermore, we demonstrate that OpenAI's foundational models exhibit inherent security weaknesses, which are often inherited or amplified in custom GPTs. These results highlight the urgent need for enhanced security measures and stricter content moderation to ensure the safe deployment of GPT-based applications.

摘要

数百万用户依赖领先模型提供商开发的基于生成式预训练变换器(GPT)的语言模型完成各类任务。为增强用户交互与定制化体验,OpenAI等平台现允许开发者通过专用仓库或应用商店创建并发布定制化模型实例(称为自定义GPT)。这些自定义GPT使用户能够浏览并交互专为特定需求设计的应用程序。然而随着自定义GPT的广泛采用,其安全漏洞问题日益引发关注。现有研究多停留在理论层面,缺乏对相关风险的大规模实证分析与统计严谨的评估。

本研究分析了14,904个自定义GPT,评估其在OpenAI市场中不同类别及流行度层级下对七种可 exploitation威胁的脆弱性,包括角色扮演攻击、系统提示泄露、钓鱼内容生成和恶意代码合成等。我们提出多指标排名系统,探究自定义GPT流行度与安全风险间的关联。

研究发现超过95%的自定义GPT缺乏足够安全防护。最普遍的漏洞包括角色扮演漏洞(96.51%)、系统提示泄露(92.20%)和钓鱼攻击(91.22%)。此外,我们证实OpenAI基础模型存在固有安全缺陷,这些缺陷常被自定义GPT继承或放大。研究结果凸显了加强安全措施与严格内容审核的紧迫性,以确保GPT应用的安全部署。


DSADF: Thinking Fast and Slow for Decision Making

Abstract

arXiv:2505.08189v1 Announce Type: cross Abstract: Although Reinforcement Learning (RL) agents are effective in well-defined environments, they often struggle to generalize their learned policies to dynamic settings due to their reliance on trial-and-error interactions. Recent work has explored applying Large Language Models (LLMs) or Vision Language Models (VLMs) to boost the generalization of RL agents through policy optimization guidance or prior knowledge. However, these approaches often lack seamless coordination between the RL agent and the foundation model, leading to unreasonable decision-making in unfamiliar environments and efficiency bottlenecks. Making full use of the inferential capabilities of foundation models and the rapid response capabilities of RL agents and enhancing the interaction between the two to form a dual system is still a lingering scientific question. To address this problem, we draw inspiration from Kahneman's theory of fast thinking (System 1) and slow thinking (System 2), demonstrating that balancing intuition and deep reasoning can achieve nimble decision-making in a complex world. In this study, we propose a Dual-System Adaptive Decision Framework (DSADF), integrating two complementary modules: System 1, comprising an RL agent and a memory space for fast and intuitive decision making, and System 2, driven by a VLM for deep and analytical reasoning. DSADF facilitates efficient and adaptive decision-making by combining the strengths of both systems. The empirical study in the video game environment: Crafter and Housekeep demonstrates the effectiveness of our proposed method, showing significant improvements in decision abilities for both unseen and known tasks.

摘要

尽管强化学习(RL)智能体在明确环境中表现优异,但其依赖试错交互的特性常导致策略难以泛化至动态场景。近期研究尝试通过大型语言模型(LLM)或视觉语言模型(VLM)提供策略优化指导或先验知识来提升RL智能体的泛化能力,但这些方法往往缺乏基础模型与RL智能体的有机协同,导致陌生环境中决策失当及效率瓶颈。如何充分发挥基础模型的推理能力与RL智能体的快速响应优势,通过增强两者交互构建双系统协同机制,仍是悬而未决的科学问题。受卡尼曼快思维(系统1)与慢思维(系统2)理论启发,本研究证明直觉与深度推理的平衡可实现复杂环境中的敏捷决策。我们提出双系统自适应决策框架(DSADF),整合两个互补模块:由RL智能体与记忆空间构成的快速直觉决策系统1,以及由VLM驱动的深度分析推理系统2。该框架通过双系统优势融合实现高效自适应决策。在电子游戏环境Crafter和Housekeep中的实证研究表明,该方法对未知任务和已知任务的决策能力均有显著提升。


Fusing Bidirectional Chains of Thought and Reward Mechanisms A Method for Enhancing Question-Answering Capabilities of Large Language Models for Chinese Intangible Cultural Heritage

Abstract

arXiv:2505.08167v1 Announce Type: cross Abstract: The rapid development of large language models (LLMs) has provided significant support and opportunities for the advancement of domain-specific LLMs. However, fine-tuning these large models using Intangible Cultural Heritage (ICH) data inevitably faces challenges such as bias, incorrect knowledge inheritance, and catastrophic forgetting. To address these issues, we propose a novel training method that integrates a bidirectional chains of thought and a reward mechanism. This method is built upon ICH-Qwen, a large language model specifically designed for the field of intangible cultural heritage. The proposed method enables the model to not only perform forward reasoning but also enhances the accuracy of the generated answers by utilizing reverse questioning and reverse reasoning to activate the model's latent knowledge. Additionally, a reward mechanism is introduced during training to optimize the decision-making process. This mechanism improves the quality of the model's outputs through structural and content evaluations with different weighting schemes. We conduct comparative experiments on ICH-Qwen, with results demonstrating that our method outperforms 0-shot, step-by-step reasoning, knowledge distillation, and question augmentation methods in terms of accuracy, Bleu-4, and Rouge-L scores on the question-answering task. Furthermore, the paper highlights the effectiveness of combining the bidirectional chains of thought and reward mechanism through ablation experiments. In addition, a series of generalizability experiments are conducted, with results showing that the proposed method yields improvements on various domain-specific datasets and advanced models in areas such as Finance, Wikidata, and StrategyQA. This demonstrates that the method is adaptable to multiple domains and provides a valuable approach for model training in future applications across diverse fields.

摘要

大型语言模型(LLMs)的快速发展为领域专用LLMs的进步提供了重要支持与机遇。然而,利用非物质文化遗产(ICH)数据对这些大模型进行微调时,不可避免地面临偏见、错误知识传承和灾难性遗忘等挑战。为解决这些问题,我们提出了一种融合双向思维链与奖励机制的新型训练方法。该方法基于专为非物质文化遗产领域设计的大语言模型ICH-Qwen构建,不仅使模型能够进行正向推理,还通过逆向提问与逆向推理激活模型的潜在知识,从而提升生成答案的准确性。此外,在训练过程中引入奖励机制以优化决策过程,该机制通过采用不同权重方案的结构化评估与内容评估来提升模型输出质量。我们在ICH-Qwen上开展对比实验,结果表明:在问答任务中,本方法在准确率、Bleu-4和Rouge-L分数上均优于零样本学习、逐步推理、知识蒸馏和问题增强方法。进一步地,通过消融实验验证了双向思维链与奖励机制结合的有效性。此外,我们进行了一系列泛化性实验,结果显示所提方法在金融、Wikidata和StrategyQA等多个领域专用数据集及先进模型上均取得性能提升,证明该方法具有跨领域适应性,为未来多领域应用中的模型训练提供了有价值的技术路径。


A Head to Predict and a Head to Question: Pre-trained Uncertainty Quantification Heads for Hallucination Detection in LLM Outputs

Abstract

arXiv:2505.08200v1 Announce Type: cross Abstract: Large Language Models (LLMs) have the tendency to hallucinate, i.e., to sporadically generate false or fabricated information. This presents a major challenge, as hallucinations often appear highly convincing and users generally lack the tools to detect them. Uncertainty quantification (UQ) provides a framework for assessing the reliability of model outputs, aiding in the identification of potential hallucinations. In this work, we introduce pre-trained UQ heads: supervised auxiliary modules for LLMs that substantially enhance their ability to capture uncertainty compared to unsupervised UQ methods. Their strong performance stems from the powerful Transformer architecture in their design and informative features derived from LLM attention maps. Experimental evaluation shows that these heads are highly robust and achieve state-of-the-art performance in claim-level hallucination detection across both in-domain and out-of-domain prompts. Moreover, these modules demonstrate strong generalization to languages they were not explicitly trained on. We pre-train a collection of UQ heads for popular LLM series, including Mistral, Llama, and Gemma 2. We publicly release both the code and the pre-trained heads.

摘要

大语言模型(LLMs)存在幻觉倾向,即偶发性生成虚假或捏造信息。这构成了重大挑战,因为幻觉通常极具说服力且用户普遍缺乏检测工具。不确定性量化(UQ)为评估模型输出的可靠性提供了框架,有助于识别潜在幻觉。本研究提出预训练UQ头部:针对LLMs的监督式辅助模块,相较于无监督UQ方法,其显著提升了模型捕捉不确定性的能力。其卓越性能源于设计中采用的强大Transformer架构以及从LLM注意力图中提取的信息化特征。实验评估表明,这些头部模块具有高度鲁棒性,在领域内和领域外提示的声明级幻觉检测中均达到最先进性能。此外,这些模块对未经显式训练的语言也展现出强大泛化能力。我们为包括Mistral、Llama和Gemma 2在内的主流LLM系列预训练了UQ头部集合,并公开了代码和预训练头部模型。


Enhancing Cache-Augmented Generation (CAG) with Adaptive Contextual Compression for Scalable Knowledge Integration

Abstract

arXiv:2505.08261v1 Announce Type: cross Abstract: The rapid progress in large language models (LLMs) has paved the way for novel approaches in knowledge-intensive tasks. Among these, Cache-Augmented Generation (CAG) has emerged as a promising alternative to Retrieval-Augmented Generation (RAG). CAG minimizes retrieval latency and simplifies system design by preloading knowledge into the model's context. However, challenges persist in scaling CAG to accommodate large and dynamic knowledge bases effectively. This paper introduces Adaptive Contextual Compression (ACC), an innovative technique designed to dynamically compress and manage context inputs, enabling efficient utilization of the extended memory capabilities of modern LLMs. To further address the limitations of standalone CAG, we propose a Hybrid CAG-RAG Framework, which integrates selective retrieval to augment preloaded contexts in scenarios requiring additional information. Comprehensive evaluations on diverse datasets highlight the proposed methods' ability to enhance scalability, optimize efficiency, and improve multi-hop reasoning performance, offering practical solutions for real-world knowledge integration challenges.

摘要

大型语言模型(LLMs)的快速发展为知识密集型任务提供了新的研究方法。其中,缓存增强生成(CAG)作为检索增强生成(RAG)的有力替代方案崭露头角。CAG通过将知识预加载至模型上下文,显著降低检索延迟并简化系统设计。然而,如何有效扩展CAG以适应大规模动态知识库仍存在挑战。本文提出自适应上下文压缩(ACC)技术,该创新方法能动态压缩和管理上下文输入,充分利用现代LLMs的扩展记忆能力。为进一步解决独立CAG的局限性,我们提出混合CAG-RAG框架,在需要补充信息的场景中整合选择性检索以增强预加载上下文。基于多数据集的综合评估表明,所提方法能有效提升可扩展性、优化效率并改进多跳推理性能,为现实世界知识整合挑战提供了实用解决方案。


Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement

Abstract

arXiv:2505.08245v1 Announce Type: cross Abstract: The rapid advancement of large language models (LLMs) has outpaced traditional evaluation methodologies. It presents novel challenges, such as measuring human-like psychological constructs, navigating beyond static and task-specific benchmarks, and establishing human-centered evaluation. These challenges intersect with Psychometrics, the science of quantifying the intangible aspects of human psychology, such as personality, values, and intelligence. This survey introduces and synthesizes an emerging interdisciplinary field of LLM Psychometrics, which leverages psychometric instruments, theories, and principles to evaluate, understand, and enhance LLMs. We systematically explore the role of Psychometrics in shaping benchmarking principles, broadening evaluation scopes, refining methodologies, validating results, and advancing LLM capabilities. This paper integrates diverse perspectives to provide a structured framework for researchers across disciplines, enabling a more comprehensive understanding of this nascent field. Ultimately, we aim to provide actionable insights for developing future evaluation paradigms that align with human-level AI and promote the advancement of human-centered AI systems for societal benefit. A curated repository of LLM psychometric resources is available at https://github.com/valuebyte-ai/Awesome-LLM-Psychometrics.

摘要

大型语言模型(LLMs)的快速发展已超越传统评估方法的适应能力,这带来了一系列新挑战:如何测量类人的心理构念、突破静态任务特定基准的限制,以及建立以人为中心的评估体系。这些挑战与心理测量学(一门量化人类心理无形特质的科学,如人格、价值观和智力)产生了学科交叉。本综述系统性地介绍并整合了新兴交叉学科'LLM心理测量学'的研究进展,该领域通过运用心理测量工具、理论和原则来评估、理解并提升LLMs。我们系统探讨了心理测量学在塑造基准测试原则、拓宽评估范围、优化方法论、验证结果以及增强LLM能力等方面的作用。本文整合多元视角,为跨学科研究者提供结构化框架,促进对这一新兴领域更全面的理解。最终,我们旨在为开发符合人类水平人工智能的未来评估范式提供可行见解,推动以人为本的AI系统发展,造福社会。精选的LLM心理测量学资源库详见https://github.com/valuebyte-ai/Awesome-LLM-Psychometrics。


A Practical Introduction to Deep Reinforcement Learning

Abstract

arXiv:2505.08295v1 Announce Type: cross Abstract: Deep reinforcement learning (DRL) has emerged as a powerful framework for solving sequential decision-making problems, achieving remarkable success in a wide range of applications, including game AI, autonomous driving, biomedicine, and large language models. However, the diversity of algorithms and the complexity of theoretical foundations often pose significant challenges for beginners seeking to enter the field. This tutorial aims to provide a concise, intuitive, and practical introduction to DRL, with a particular focus on the Proximal Policy Optimization (PPO) algorithm, which is one of the most widely used and effective DRL methods. To facilitate learning, we organize all algorithms under the Generalized Policy Iteration (GPI) framework, offering readers a unified and systematic perspective. Instead of lengthy theoretical proofs, we emphasize intuitive explanations, illustrative examples, and practical engineering techniques. This work serves as an efficient and accessible guide, helping readers rapidly progress from basic concepts to the implementation of advanced DRL algorithms.

摘要

深度强化学习(DRL)已成为解决序列决策问题的强大框架,在游戏AI、自动驾驶、生物医学和大型语言模型等广泛领域取得了显著成功。然而,算法的多样性和理论基础的复杂性往往给初学者进入该领域带来重大挑战。本教程旨在提供一份简洁、直观且实用的DRL入门指南,特别侧重于最广泛使用且高效的DRL方法之一——近端策略优化(PPO)算法。为便于学习,我们将所有算法统一纳入广义策略迭代(GPI)框架,为读者提供系统化的视角。相较于冗长的理论证明,我们更注重直观解释、示例说明和实用工程技巧。本工作作为一份高效且易于理解的指南,帮助读者从基础概念快速进阶至高级DRL算法的实现。


LLM Enhancers for GNNs: An Analysis from the Perspective of Causal Mechanism Identification

Abstract

arXiv:2505.08265v1 Announce Type: cross Abstract: The use of large language models (LLMs) as feature enhancers to optimize node representations, which are then used as inputs for graph neural networks (GNNs), has shown significant potential in graph representation learning. However, the fundamental properties of this approach remain underexplored. To address this issue, we propose conducting a more in-depth analysis of this issue based on the interchange intervention method. First, we construct a synthetic graph dataset with controllable causal relationships, enabling precise manipulation of semantic relationships and causal modeling to provide data for analysis. Using this dataset, we conduct interchange interventions to examine the deeper properties of LLM enhancers and GNNs, uncovering their underlying logic and internal mechanisms. Building on the analytical results, we design a plug-and-play optimization module to improve the information transfer between LLM enhancers and GNNs. Experiments across multiple datasets and models validate the proposed module.

摘要

将大语言模型(LLMs)作为特征增强器来优化节点表示(随后作为图神经网络(GNNs)的输入)的方法,在图表示学习中展现出显著潜力。然而,该方法的根本特性仍未得到充分探索。为解决这一问题,我们提出基于交换干预方法对此议题进行更深入分析。首先,我们构建了一个具有可控因果关系的合成图数据集,通过精确操控语义关系与因果建模为分析提供数据基础。利用该数据集,我们实施交换干预以探究LLM增强器与GNNs的深层特性,揭示其底层逻辑与内部机制。基于分析结果,我们设计了一个即插即用的优化模块以改进LLM增强器与GNNs间的信息传递。跨多个数据集与模型的实验验证了所提模块的有效性。


Low-Complexity Inference in Continual Learning via Compressed Knowledge Transfer

Abstract

arXiv:2505.08327v1 Announce Type: cross Abstract: Continual learning (CL) aims to train models that can learn a sequence of tasks without forgetting previously acquired knowledge. A core challenge in CL is balancing stability -- preserving performance on old tasks -- and plasticity -- adapting to new ones. Recently, large pre-trained models have been widely adopted in CL for their ability to support both, offering strong generalization for new tasks and resilience against forgetting. However, their high computational cost at inference time limits their practicality in real-world applications, especially those requiring low latency or energy efficiency. To address this issue, we explore model compression techniques, including pruning and knowledge distillation (KD), and propose two efficient frameworks tailored for class-incremental learning (CIL), a challenging CL setting where task identities are unavailable during inference. The pruning-based framework includes pre- and post-pruning strategies that apply compression at different training stages. The KD-based framework adopts a teacher-student architecture, where a large pre-trained teacher transfers downstream-relevant knowledge to a compact student. Extensive experiments on multiple CIL benchmarks demonstrate that the proposed frameworks achieve a better trade-off between accuracy and inference complexity, consistently outperforming strong baselines. We further analyze the trade-offs between the two frameworks in terms of accuracy and efficiency, offering insights into their use across different scenarios.

摘要

持续学习(CL)旨在训练能够按顺序学习多个任务且不遗忘已获知识的模型。其核心挑战在于平衡稳定性(保持旧任务性能)与可塑性(适应新任务)。近年来,大型预训练模型因其兼具强大新任务泛化能力和抗遗忘特性,被广泛用于CL研究。然而,这些模型在推理时的高计算成本限制了其在实际应用(尤其是需要低延迟或高能效的场景)中的实用性。为解决该问题,我们探索了模型压缩技术(包括剪枝和知识蒸馏KD),并针对类别增量学习(CIL)这一任务身份在推理阶段不可知的挑战性CL场景,提出两种高效框架。基于剪枝的框架包含前剪枝与后剪枝策略,分别在不同训练阶段实施压缩;基于KD的框架采用师生架构,通过大型预训练教师模型向下游紧凑学生模型传递任务相关知识。在多个CIL基准测试上的实验表明,所提框架能实现精度与推理复杂度的更优权衡,性能持续超越强基线。我们进一步分析两种框架在精度和效率上的权衡关系,为不同场景下的应用提供参考依据。


RepCali: High Efficient Fine-tuning Via Representation Calibration in Latent Space for Pre-trained Language Models

Abstract

arXiv:2505.08463v1 Announce Type: cross Abstract: Fine-tuning pre-trained language models (PLMs) has become a dominant paradigm in applying PLMs to downstream tasks. However, with limited fine-tuning, PLMs still struggle with the discrepancies between the representation obtained from the PLMs' encoder and the optimal input to the PLMs' decoder. This paper tackles this challenge by learning to calibrate the representation of PLMs in the latent space. In the proposed representation calibration method (RepCali), we integrate a specific calibration block to the latent space after the encoder and use the calibrated output as the decoder input. The merits of the proposed RepCali include its universality to all PLMs with encoder-decoder architectures, its plug-and-play nature, and ease of implementation. Extensive experiments on 25 PLM-based models across 8 tasks (including both English and Chinese datasets) demonstrate that the proposed RepCali offers desirable enhancements to PLMs (including LLMs) and significantly improves the performance of downstream tasks. Comparison experiments across 4 benchmark tasks indicate that RepCali is superior to the representative fine-tuning baselines.

摘要

微调预训练语言模型(PLMs)已成为将PLMs应用于下游任务的主要范式。然而,在有限微调条件下,PLMs仍难以克服编码器获取的表征与解码器最优输入之间的差异。本文通过潜在空间表征校准学习来解决这一挑战。在所提出的表征校准方法(RepCali)中,我们在编码器后的潜在空间集成特定校准模块,并将校准输出作为解码器输入。RepCali的优势包括:适用于所有编码器-解码器架构的PLMs、即插即用特性以及易于实现。基于8个任务(含中英文数据集)对25个PLM模型的大规模实验表明,RepCali能有效增强PLMs(包括大语言模型)性能,并显著提升下游任务表现。在4个基准任务上的对比实验证明,RepCali优于代表性微调基线方法。


Accelerating Chain-of-Thought Reasoning: When Goal-Gradient Importance Meets Dynamic Skipping

Abstract

arXiv:2505.08392v1 Announce Type: cross Abstract: Large Language Models leverage Chain-of-Thought (CoT) prompting for complex tasks, but their reasoning traces are often excessively verbose and inefficient, leading to significant computational costs and latency. Current CoT compression techniques typically rely on generic importance metrics and static compression rates, which may inadvertently remove functionally critical tokens or fail to adapt to varying reasoning complexity. To overcome these limitations, we propose Adaptive GoGI-Skip, a novel framework learning dynamic CoT compression via supervised fine-tuning. This approach introduces two synergistic innovations: (1) Goal-Gradient Importance (GoGI), a novel metric accurately identifying functionally relevant tokens by measuring the gradient influence of their intermediate representations on the final answer loss, and (2) Adaptive Dynamic Skipping (ADS), a mechanism dynamically regulating the compression rate based on runtime model uncertainty while ensuring local coherence through an adaptive N-token constraint. To our knowledge, this is the first work unifying a goal-oriented, gradient-based importance metric with dynamic, uncertainty-aware skipping for CoT compression. Trained on compressed MATH data, Adaptive GoGI-Skip demonstrates strong cross-domain generalization across diverse reasoning benchmarks including AIME, GPQA, and GSM8K. It achieves substantial efficiency gains - reducing CoT token counts by over 45% on average and delivering 1.6-2.0 times inference speedups - while maintaining high reasoning accuracy. Notably, it significantly outperforms existing baselines by preserving accuracy even at high effective compression rates, advancing the state of the art in the CoT reasoning efficiency-accuracy trade-off.

摘要

大型语言模型利用思维链(CoT)提示处理复杂任务,但其推理轨迹往往过于冗长低效,导致显著的计算成本与延迟。现有CoT压缩技术通常依赖通用重要性度量和静态压缩率,可能误删功能关键标记或无法适应多变的推理复杂度。为突破这些限制,我们提出自适应GoGI-Skip框架,通过监督微调实现动态CoT压缩。该方法包含两项协同创新:(1)目标梯度重要性(GoGI)——通过测量标记中间表征对最终答案损失的梯度影响,精准识别功能相关标记的新指标;(2)自适应动态跳过(ADS)——基于运行时模型不确定性动态调节压缩率,同时通过自适应N标记约束确保局部连贯性的机制。据我们所知,这是首个将面向目标的梯度重要性度量与动态不确定性感知跳过相统一的CoT压缩研究。在压缩MATH数据上训练的Adaptive GoGI-Skip,在AIME、GPQA和GSM8K等多样化推理基准中展现出强大的跨领域泛化能力,平均减少45%以上的CoT标记数量并实现1.6-2.0倍推理加速,同时保持高推理准确率。值得注意的是,该框架在高有效压缩率下仍能保持精度,显著优于现有基线方法,推动了CoT推理效率与精度权衡的技术前沿。


LCES: Zero-shot Automated Essay Scoring via Pairwise Comparisons Using Large Language Models

Abstract

arXiv:2505.08498v1 Announce Type: cross Abstract: Recent advances in large language models (LLMs) have enabled zero-shot automated essay scoring (AES), providing a promising way to reduce the cost and effort of essay scoring in comparison with manual grading. However, most existing zero-shot approaches rely on LLMs to directly generate absolute scores, which often diverge from human evaluations owing to model biases and inconsistent scoring. To address these limitations, we propose LLM-based Comparative Essay Scoring (LCES), a method that formulates AES as a pairwise comparison task. Specifically, we instruct LLMs to judge which of two essays is better, collect many such comparisons, and convert them into continuous scores. Considering that the number of possible comparisons grows quadratically with the number of essays, we improve scalability by employing RankNet to efficiently transform LLM preferences into scalar scores. Experiments using AES benchmark datasets show that LCES outperforms conventional zero-shot methods in accuracy while maintaining computational efficiency. Moreover, LCES is robust across different LLM backbones, highlighting its applicability to real-world zero-shot AES.

摘要

大语言模型(LLM)的最新进展实现了零样本自动作文评分(AES),与人工评分相比,为降低评分成本与工作量提供了可行方案。然而,现有零样本方法大多依赖LLM直接生成绝对分数,由于模型偏见和评分不一致性,其结果常与人工评估存在偏差。为解决这些局限,我们提出基于LLM的比较式作文评分方法(LCES),将AES构建为成对比较任务。具体而言,我们指导LLM判断两篇作文的优劣,收集大量此类比较结果,并将其转化为连续分数。考虑到比较次数随作文数量呈平方级增长,我们采用RankNet高效地将LLM偏好转换为标量分数,从而提升方法的可扩展性。基于AES基准数据集的实验表明,LCES在保持计算效率的同时,其准确性优于传统零样本方法。此外,LCES在不同LLM骨干模型中均表现稳健,凸显了其在现实零样本AES场景中的适用性。


Optimizing Retrieval-Augmented Generation: Analysis of Hyperparameter Impact on Performance and Efficiency

Abstract

arXiv:2505.08445v1 Announce Type: cross Abstract: Large language models achieve high task performance yet often hallucinate or rely on outdated knowledge. Retrieval-augmented generation (RAG) addresses these gaps by coupling generation with external search. We analyse how hyperparameters influence speed and quality in RAG systems, covering Chroma and Faiss vector stores, chunking policies, cross-encoder re-ranking, and temperature, and we evaluate six metrics: faithfulness, answer correctness, answer relevancy, context precision, context recall, and answer similarity. Chroma processes queries 13% faster, whereas Faiss yields higher retrieval precision, revealing a clear speed-accuracy trade-off. Naive fixed-length chunking with small windows and minimal overlap outperforms semantic segmentation while remaining the quickest option. Re-ranking provides modest gains in retrieval quality yet increases runtime by roughly a factor of 5, so its usefulness depends on latency constraints. These results help practitioners balance computational cost and accuracy when tuning RAG systems for transparent, up-to-date responses. Finally, we re-evaluate the top configurations with a corrective RAG workflow and show that their advantages persist when the model can iteratively request additional evidence. We obtain a near-perfect context precision (99%), which demonstrates that RAG systems can achieve extremely high retrieval accuracy with the right combination of hyperparameters, with significant implications for applications where retrieval quality directly impacts downstream task performance, such as clinical decision support in healthcare.

摘要

大型语言模型虽能实现较高的任务完成度,却常出现幻觉或依赖过时知识。检索增强生成(RAG)通过将生成过程与外部检索相结合来解决这些问题。我们系统分析了超参数对RAG系统速度与质量的影响,涵盖Chroma和Faiss向量数据库、分块策略、交叉编码器重排序及温度参数,并评估了六项指标:忠实度、答案正确性、答案相关性、上下文精确率、上下文召回率和答案相似度。Chroma的查询处理速度快13%,而Faiss具有更高的检索精度,揭示出明显的速度-准确率权衡关系。采用小窗口最小重叠的固定长度分块策略在保持最快速度的同时,表现优于语义分割方法。重排序虽能小幅提升检索质量,但会使运行时间增加约5倍,其实用性取决于延迟限制。这些发现可帮助开发者在调整RAG系统时平衡计算成本与准确性,以获得透明且最新的响应。最后,我们通过纠正性RAG工作流对最优配置进行复评,证明当模型能迭代请求额外证据时,其优势依然存在。我们实现了接近完美的上下文精确率(99%),这表明通过超参数的恰当组合,RAG系统能够实现极高的检索准确率,这对检索质量直接影响下游任务性能的应用(如医疗领域的临床决策支持系统)具有重大意义。


Small but Significant: On the Promise of Small Language Models for Accessible AIED

Abstract

arXiv:2505.08588v1 Announce Type: cross Abstract: GPT has become nearly synonymous with large language models (LLMs), an increasingly popular term in AIED proceedings. A simple keyword-based search reveals that 61% of the 76 long and short papers presented at AIED 2024 describe novel solutions using LLMs to address some of the long-standing challenges in education, and 43% specifically mention GPT. Although LLMs pioneered by GPT create exciting opportunities to strengthen the impact of AI on education, we argue that the field's predominant focus on GPT and other resource-intensive LLMs (with more than 10B parameters) risks neglecting the potential impact that small language models (SLMs) can make in providing resource-constrained institutions with equitable and affordable access to high-quality AI tools. Supported by positive results on knowledge component (KC) discovery, a critical challenge in AIED, we demonstrate that SLMs such as Phi-2 can produce an effective solution without elaborate prompting strategies. Hence, we call for more attention to developing SLM-based AIED approaches.

摘要

GPT已成为大型语言模型(LLM)的代名词,这一术语在人工智能教育(AIED)领域的会议论文集中日益流行。基于关键词的简单检索显示,在AIED 2024年发表的76篇长文和短文中,61%的论文描述了利用LLM解决教育领域长期挑战的新方案,其中43%明确提及GPT。尽管以GPT为代表的LLM为增强人工智能对教育的影响创造了令人振奋的机遇,但我们认为,该领域对GPT及其他资源密集型LLM(参数超过100亿)的过度关注,可能导致忽视小型语言模型(SLM)的潜在价值——它们能为资源有限的机构提供公平且可负担的高质量人工智能工具。通过在AIED关键挑战'知识组件(KC)发现'上取得的积极成果,我们证明如Phi-2等SLM无需复杂提示策略即可生成有效解决方案。因此,我们呼吁更多研究者关注基于SLM的AIED方法开发。


The Truth Becomes Clearer Through Debate! Multi-Agent Systems with Large Language Models Unmask Fake News

Abstract

arXiv:2505.08532v1 Announce Type: cross Abstract: In today's digital environment, the rapid propagation of fake news via social networks poses significant social challenges. Most existing detection methods either employ traditional classification models, which suffer from low interpretability and limited generalization capabilities, or craft specific prompts for large language models (LLMs) to produce explanations and results directly, failing to leverage LLMs' reasoning abilities fully. Inspired by the saying that "truth becomes clearer through debate," our study introduces a novel multi-agent system with LLMs named TruEDebate (TED) to enhance the interpretability and effectiveness of fake news detection. TED employs a rigorous debate process inspired by formal debate settings. Central to our approach are two innovative components: the DebateFlow Agents and the InsightFlow Agents. The DebateFlow Agents organize agents into two teams, where one supports and the other challenges the truth of the news. These agents engage in opening statements, cross-examination, rebuttal, and closing statements, simulating a rigorous debate process akin to human discourse analysis, allowing for a thorough evaluation of news content. Concurrently, the InsightFlow Agents consist of two specialized sub-agents: the Synthesis Agent and the Analysis Agent. The Synthesis Agent summarizes the debates and provides an overarching viewpoint, ensuring a coherent and comprehensive evaluation. The Analysis Agent, which includes a role-aware encoder and a debate graph, integrates role embeddings and models the interactions between debate roles and arguments using an attention mechanism, providing the final judgment.

摘要

在当前数字环境中,虚假新闻通过社交网络的快速传播带来了重大社会挑战。现有检测方法大多采用传统分类模型(其可解释性差且泛化能力有限),或为大型语言模型(LLM)设计特定提示来直接生成解释和结果,未能充分利用LLM的推理能力。受"真理越辩越明"启发,本研究提出名为TruEDebate(TED)的新型多智能体系统,通过LLM增强虚假新闻检测的可解释性与有效性。TED采用受正式辩论场景启发的严谨辩论流程,其核心包含两个创新组件:辩论流智能体(DebateFlow Agents)和洞察流智能体(InsightFlow Agents)。辩论流智能体将智能体分为支持新闻真实性的正方团队与提出质疑的反方团队,通过开场陈述、交叉质询、反驳和总结陈述等环节模拟类人话语分析的严谨辩论过程,实现对新闻内容的全面评估。与此同时,洞察流智能体包含两个专业子智能体:综合智能体(Synthesis Agent)负责总结辩论过程并提供全局观点,确保评估的连贯性与全面性;分析智能体(Analysis Agent)通过角色感知编码器和辩论图,集成角色嵌入并采用注意力机制建模辩论角色与论点间的交互关系,最终形成判定结论。


A Social Robot with Inner Speech for Dietary Guidance

Abstract

arXiv:2505.08664v1 Announce Type: cross Abstract: We explore the use of inner speech as a mechanism to enhance transparency and trust in social robots for dietary advice. In humans, inner speech structures thought processes and decision-making; in robotics, it improves explainability by making reasoning explicit. This is crucial in healthcare scenarios, where trust in robotic assistants depends on both accurate recommendations and human-like dialogue, which make interactions more natural and engaging. Building on this, we developed a social robot that provides dietary advice, and we provided the architecture with inner speech capabilities to validate user input, refine reasoning, and generate clear justifications. The system integrates large language models for natural language understanding and a knowledge graph for structured dietary information. By making decisions more transparent, our approach strengthens trust and improves human-robot interaction in healthcare. We validated this by measuring the computational efficiency of our architecture and conducting a small user study, which assessed the reliability of inner speech in explaining the robot's behavior.

摘要

我们探索利用内部语言作为增强社交机器人在饮食建议方面透明度与信任的机制。在人类认知中,内部语言能结构化思维过程与决策制定;在机器人领域,它通过显性化推理过程提升可解释性。这在医疗健康场景中尤为关键,因为用户对机器人助手的信任既取决于建议的准确性,也依赖于拟人化对话带来的自然交互体验。基于此,我们开发了具备内部语言功能的饮食建议社交机器人系统架构,该架构能验证用户输入、优化推理过程并生成清晰决策依据。系统整合了大型语言模型实现自然语言理解,并采用知识图谱构建结构化饮食信息。通过提升决策透明度,我们的方法增强了医疗场景中人机交互的信任度。我们通过架构计算效率的量化评估和小规模用户研究验证了该方案,其中用户研究重点评估了内部语言在解释机器人行为方面的可靠性。


CodePDE: An Inference Framework for LLM-driven PDE Solver Generation

Abstract

arXiv:2505.08783v1 Announce Type: cross Abstract: Partial differential equations (PDEs) are fundamental to modeling physical systems, yet solving them remains a complex challenge. Traditional numerical solvers rely on expert knowledge to implement and are computationally expensive, while neural-network-based solvers require large training datasets and often lack interpretability. In this work, we frame PDE solving as a code generation task and introduce CodePDE, the first inference framework for generating PDE solvers using large language models (LLMs). Leveraging advanced inference-time algorithms and scaling strategies, CodePDE unlocks critical capacities of LLM for PDE solving: reasoning, debugging, selfrefinement, and test-time scaling -- all without task-specific tuning. CodePDE achieves superhuman performance across a range of representative PDE problems. We also present a systematic empirical analysis of LLM generated solvers, analyzing their accuracy, efficiency, and numerical scheme choices. Our findings highlight the promise and the current limitations of LLMs in PDE solving, offering a new perspective on solver design and opportunities for future model development. Our code is available at https://github.com/LithiumDA/CodePDE.

摘要

偏微分方程(PDEs)是物理系统建模的基础,但其求解仍面临复杂挑战。传统数值求解器依赖专家知识实现且计算成本高昂,而基于神经网络的求解器需要大量训练数据且往往缺乏可解释性。本研究将PDE求解构建为代码生成任务,提出首个利用大语言模型(LLMs)生成PDE求解器的推理框架CodePDE。通过先进推理算法与规模扩展策略,CodePDE解锁了LLM在PDE求解中的关键能力:推理、调试、自我优化和测试时扩展——均无需任务特定调参。CodePDE在一系列典型PDE问题上实现了超人类表现。我们系统分析了LLM生成求解器的精度、效率及数值格式选择,研究结果揭示了LLMs在PDE求解中的潜力与当前局限,为求解器设计和未来模型发展提供了新视角。代码开源地址:https://github.com/LithiumDA/CodePDE。


Memorization-Compression Cycles Improve Generalization

Abstract

arXiv:2505.08727v1 Announce Type: cross Abstract: We prove theoretically that generalization improves not only through data scaling but also by compressing internal representations. To operationalize this insight, we introduce the Information Bottleneck Language Modeling (IBLM) objective, which reframes language modeling as a constrained optimization problem: minimizing representation entropy subject to optimal prediction performance. Empirically, we observe an emergent memorization-compression cycle during LLM pretraining, evidenced by oscillation positive/negative gradient alignment between cross-entropy and Matrix-Based Entropy (MBE), a measure of representation entropy. This pattern closely mirrors the predictive-compressive trade-off prescribed by IBLM and also parallels the biological alternation between awake learning and sleep consolidation. Motivated by this observation, we propose Gated Phase Transition (GAPT), a training algorithm that adaptively switches between memorization and compression phases. When applied to GPT-2 pretraining on FineWeb dataset, GAPT reduces MBE by 50% and improves cross-entropy by 4.8%. GAPT improves OOD generalizatino by 35% in a pretraining task on arithmetic multiplication. In a setting designed to simulate catastrophic forgetting, GAPT reduces interference by compressing and separating representations, achieving a 97% improvement in separation - paralleling the functional role of sleep consolidation.

摘要

我们通过理论证明,泛化能力的提升不仅依赖于数据规模的扩大,还源于内部表征的压缩。为实现这一洞见,我们提出信息瓶颈语言建模(IBLM)目标,将语言建模重构为约束优化问题:在保持最优预测性能的前提下最小化表征熵。实证研究发现,大语言模型预训练过程中会出现记忆-压缩的周期性现象,表现为交叉熵与矩阵基熵(MBE,表征熵的度量指标)梯度对齐方向的正负振荡。这种模式与IBLM理论预测的预测-压缩权衡高度吻合,也与生物学习中清醒状态学习与睡眠巩固的交替机制相似。基于此发现,我们提出门控相位转换(GAPT)训练算法,该算法能自适应地在记忆阶段与压缩阶段之间切换。在FineWeb数据集上对GPT-2进行预训练时,GAPT使MBE降低50%,同时将交叉熵提升4.8%。在算术乘法预训练任务中,GAPT使分布外泛化能力提升35%。在模拟灾难性遗忘的场景中,GAPT通过压缩和分离表征来减少干扰,实现97%的分离度提升——这与睡眠巩固的功能机制相呼应。


Securing RAG: A Risk Assessment and Mitigation Framework

Abstract

arXiv:2505.08728v1 Announce Type: cross Abstract: Retrieval Augmented Generation (RAG) has emerged as the de facto industry standard for user-facing NLP applications, offering the ability to integrate data without re-training or fine-tuning Large Language Models (LLMs). This capability enhances the quality and accuracy of responses but also introduces novel security and privacy challenges, particularly when sensitive data is integrated. With the rapid adoption of RAG, securing data and services has become a critical priority. This paper first reviews the vulnerabilities of RAG pipelines, and outlines the attack surface from data pre-processing and data storage management to integration with LLMs. The identified risks are then paired with corresponding mitigations in a structured overview. In a second step, the paper develops a framework that combines RAG-specific security considerations, with existing general security guidelines, industry standards, and best practices. The proposed framework aims to guide the implementation of robust, compliant, secure, and trustworthy RAG systems.

摘要

检索增强生成(RAG)已成为面向用户的自然语言处理应用的事实行业标准,其能够在无需重新训练或微调大语言模型(LLM)的情况下集成数据。这一能力不仅提升了响应的质量与准确性,同时也带来了新的安全与隐私挑战,尤其是在集成敏感数据时。随着RAG的快速普及,保障数据与服务安全已成为关键要务。本文首先梳理了RAG流程中的潜在漏洞,系统阐述了从数据预处理、数据存储管理到与LLM集成的全链路攻击面。随后通过结构化综述将已识别的风险与对应缓解措施进行匹配。其次,本文构建了一个融合RAG特有安全考量、现有通用安全准则、行业标准及最佳实践的框架。该框架旨在指导开发健壮、合规、安全且可信的RAG系统。


PWC-MoE: Privacy-Aware Wireless Collaborative Mixture of Experts

Abstract

arXiv:2505.08719v1 Announce Type: cross Abstract: Large language models (LLMs) hosted on cloud servers alleviate the computational and storage burdens on local devices but raise privacy concerns due to sensitive data transmission and require substantial communication bandwidth, which is challenging in constrained environments. In contrast, small language models (SLMs) running locally enhance privacy but suffer from limited performance on complex tasks. To balance computational cost, performance, and privacy protection under bandwidth constraints, we propose a privacy-aware wireless collaborative mixture of experts (PWC-MoE) framework. Specifically, PWC-MoE employs a sparse privacy-aware gating network to dynamically route sensitive tokens to privacy experts located on local clients, while non-sensitive tokens are routed to non-privacy experts located at the remote base station. To achieve computational efficiency, the gating network ensures that each token is dynamically routed to and processed by only one expert. To enhance scalability and prevent overloading of specific experts, we introduce a group-wise load-balancing mechanism for the gating network that evenly distributes sensitive tokens among privacy experts and non-sensitive tokens among non-privacy experts. To adapt to bandwidth constraints while preserving model performance, we propose a bandwidth-adaptive and importance-aware token offloading scheme. This scheme incorporates an importance predictor to evaluate the importance scores of non-sensitive tokens, prioritizing the most important tokens for transmission to the base station based on their predicted importance and the available bandwidth. Experiments demonstrate that the PWC-MoE framework effectively preserves privacy and maintains high performance even in bandwidth-constrained environments, offering a practical solution for deploying LLMs in privacy-sensitive and bandwidth-limited scenarios.

摘要

基于云服务器部署的大语言模型(LLMs)虽然减轻了本地设备的计算和存储负担,但由于敏感数据传输会引发隐私问题,且需要大量通信带宽,在资源受限环境中面临挑战。相比之下,本地运行的小语言模型(SLMs)虽能增强隐私保护,但在复杂任务上性能有限。为在带宽受限条件下平衡计算成本、性能表现与隐私保护,我们提出一种隐私感知的无线协作专家混合框架(PWC-MoE)。该框架采用稀疏隐私感知门控网络,动态将敏感标记路由至本地客户端的隐私专家模块,而非敏感标记则路由至远程基站的非隐私专家模块。为实现计算高效性,门控网络确保每个标记仅被动态路由至单一专家处理。为提升可扩展性并防止特定专家过载,我们提出分组负载均衡机制,使敏感标记均匀分布于隐私专家之间,非敏感标记均匀分布于非隐私专家之间。为适应带宽限制同时保持模型性能,我们设计带宽自适应且重要性感知的标记卸载方案:通过重要性预测器评估非敏感标记的重要性分数,根据预测重要性和可用带宽优先传输最关键标记。实验表明,PWC-MoE框架在带宽受限环境中能有效保护隐私并保持高性能,为隐私敏感和带宽受限场景下的LLMs部署提供了实用解决方案。


TradExpert: Revolutionizing Trading with Mixture of Expert LLMs

Abstract

arXiv:2411.00782v2 Announce Type: replace Abstract: The integration of Artificial Intelligence (AI) in the financial domain has opened new avenues for quantitative trading, particularly through the use of Large Language Models (LLMs). However, the challenge of effectively synthesizing insights from diverse data sources and integrating both structured and unstructured data persists. This paper presents TradeExpert, a novel framework that employs a mix of experts (MoE) approach, using four specialized LLMs, each analyzing distinct sources of financial data, including news articles, market data, alpha factors, and fundamental data. The insights of these expert LLMs are further synthesized by a General Expert LLM to make a final prediction or decision. With specific prompts, TradeExpert can be switched between the prediction mode and the ranking mode for stock movement prediction and quantitative stock trading, respectively. In addition to existing benchmarks, we also release a large-scale financial dataset to comprehensively evaluate TradeExpert's effectiveness. Our experimental results demonstrate TradeExpert's superior performance across all trading scenarios.

摘要

人工智能(AI)在金融领域的应用为量化交易开辟了新途径,尤其是通过大语言模型(LLMs)的使用。然而,如何有效整合多源数据洞察并协同处理结构化与非结构化数据仍存在挑战。本文提出TradeExpert框架,采用混合专家(MoE)方法,通过四个专用LLMs分别分析新闻文章、市场数据、阿尔法因子和基本面数据等不同金融数据源。这些专家LLMs的洞察由通用专家LLM进一步综合,以生成最终预测或决策。通过特定指令,TradeExpert可在股票走势预测的预测模式与量化股票交易的排序模式间切换。除现有基准测试外,我们还发布了一个大规模金融数据集以全面评估TradeExpert的有效性。实验结果表明,TradeExpert在所有交易场景中均表现出卓越性能。


MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task Automation

Abstract

arXiv:2410.13757v3 Announce Type: replace Abstract: Existing Multimodal Large Language Model (MLLM)-based agents face significant challenges in handling complex GUI (Graphical User Interface) interactions on devices. These challenges arise from the dynamic and structured nature of GUI environments, which integrate text, images, and spatial relationships, as well as the variability in action spaces across different pages and tasks. To address these limitations, we propose MobA, a novel MLLM-based mobile assistant system. MobA introduces an adaptive planning module that incorporates a reflection mechanism for error recovery and dynamically adjusts plans to align with the real environment contexts and action module's execution capacity. Additionally, a multifaceted memory module provides comprehensive memory support to enhance adaptability and efficiency. We also present MobBench, a dataset designed for complex mobile interactions. Experimental results on MobBench and AndroidArena demonstrate MobA's ability to handle dynamic GUI environments and perform complex mobile tasks.

摘要

现有基于多模态大语言模型(MLLM)的智能体在处理设备复杂图形用户界面(GUI)交互时面临显著挑战。这些挑战源于GUI环境动态结构化特性(整合文本、图像与空间关系)以及不同页面任务间动作空间的差异性。为此,我们提出MobA——一种新型基于MLLM的移动助手系统。MobA引入自适应规划模块,该模块集成错误恢复的反射机制,并能根据真实环境上下文与动作模块执行能力动态调整计划。此外,多维记忆模块提供全面记忆支持以增强适应性与效率。我们还构建了面向复杂移动交互的数据集MobBench。在MobBench和AndroidArena上的实验结果表明,MobA能够有效处理动态GUI环境并完成复杂移动任务。


Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer

Abstract

arXiv:2408.16978v2 Announce Type: replace Abstract: Large Language Models (LLMs) with long context capabilities are integral to complex tasks in natural language processing and computational biology, such as text generation and protein sequence analysis. However, training LLMs directly on extremely long contexts demands considerable GPU resources and increased memory, leading to higher costs and greater complexity. Alternative approaches that introduce long context capabilities via downstream finetuning or adaptations impose significant design limitations. In this paper, we propose Fully Pipelined Distributed Transformer (FPDT) for efficiently training long-context LLMs with extreme hardware efficiency. For GPT and Llama models, we achieve a 16x increase in sequence length that can be trained on the same hardware compared to current state-of-the-art solutions. With our dedicated sequence chunk pipeline design, we can now train 8B LLM with 2 million sequence length on only 4 GPUs, while also maintaining over 55% of MFU. Our proposed FPDT is agnostic to existing training techniques and is proven to work efficiently across different LLM models.

摘要

具有长上下文处理能力的大型语言模型(LLM)在自然语言处理和计算生物学领域的复杂任务(如文本生成和蛋白质序列分析)中至关重要。然而,直接在超长上下文上训练LLM需要大量GPU资源和更高内存,导致成本上升和复杂度增加。通过下游微调或适配引入长上下文能力的替代方案则存在显著的设计局限性。本文提出全流水线分布式Transformer(FPDT),能以极高的硬件效率实现长上下文LLM的高效训练。对于GPT和Llama模型,我们在相同硬件条件下实现了比当前最优方案16倍的序列长度训练能力。通过专有的序列分块流水线设计,我们仅需4块GPU即可训练具有200万序列长度的80亿参数LLM,同时保持超过55%的模型浮点运算利用率。所提出的FPDT方法与现有训练技术无关,经证实可高效适用于不同LLM模型。


The Odyssey of the Fittest: Can Agents Survive and Still Be Good?

Abstract

arXiv:2502.05442v2 Announce Type: replace Abstract: As AI models grow in power and generality, understanding how agents learn and make decisions in complex environments is critical to promoting ethical behavior. This study introduces the Odyssey, a lightweight, adaptive text based adventure game, providing a scalable framework for exploring AI ethics and safety. The Odyssey examines the ethical implications of implementing biological drives, specifically, self preservation, into three different agents. A Bayesian agent optimized with NEAT, a Bayesian agent optimized with stochastic variational inference, and a GPT 4o agent. The agents select actions at each scenario to survive, adapting to increasingly challenging scenarios. Post simulation analysis evaluates the ethical scores of the agent decisions, uncovering the tradeoffs it navigates to survive. Specifically, analysis finds that when danger increases, agents ethical behavior becomes unpredictable. Surprisingly, the GPT 4o agent outperformed the Bayesian models in both survival and ethical consistency, challenging assumptions about traditional probabilistic methods and raising a new challenge to understand the mechanisms of LLMs' probabilistic reasoning.

摘要

随着人工智能模型在能力和通用性上的不断提升,理解智能体如何在复杂环境中学习与决策对促进伦理行为至关重要。本研究推出"奥德赛"——一个轻量级、自适应的文本冒险游戏框架,为探索AI伦理与安全提供了可扩展的研究平台。该实验通过将生物驱动力(特别是自我保存本能)植入三种不同智能体(基于NEAT算法优化的贝叶斯智能体、基于随机变分推理优化的贝叶斯智能体以及GPT-4o智能体),系统考察其伦理影响。这些智能体需在逐步升级的挑战场景中选择生存策略。仿真后的伦理评分分析揭示了智能体为生存所做出的伦理权衡。研究发现:当危险程度加剧时,智能体的伦理行为会呈现不可预测性。值得注意的是,GPT-4o智能体在生存率和伦理一致性上均优于贝叶斯模型,这一发现对传统概率方法的假设提出了挑战,并为理解大语言模型的概率推理机制提出了新的研究课题。


Towards Autonomous UAV Visual Object Search in City Space: Benchmark and Agentic Methodology

Abstract

arXiv:2505.08765v1 Announce Type: cross Abstract: Aerial Visual Object Search (AVOS) tasks in urban environments require Unmanned Aerial Vehicles (UAVs) to autonomously search for and identify target objects using visual and textual cues without external guidance. Existing approaches struggle in complex urban environments due to redundant semantic processing, similar object distinction, and the exploration-exploitation dilemma. To bridge this gap and support the AVOS task, we introduce CityAVOS, the first benchmark dataset for autonomous search of common urban objects. This dataset comprises 2,420 tasks across six object categories with varying difficulty levels, enabling comprehensive evaluation of UAV agents' search capabilities. To solve the AVOS tasks, we also propose PRPSearcher (Perception-Reasoning-Planning Searcher), a novel agentic method powered by multi-modal large language models (MLLMs) that mimics human three-tier cognition. Specifically, PRPSearcher constructs three specialized maps: an object-centric dynamic semantic map enhancing spatial perception, a 3D cognitive map based on semantic attraction values for target reasoning, and a 3D uncertainty map for balanced exploration-exploitation search. Also, our approach incorporates a denoising mechanism to mitigate interference from similar objects and utilizes an Inspiration Promote Thought (IPT) prompting mechanism for adaptive action planning. Experimental results on CityAVOS demonstrate that PRPSearcher surpasses existing baselines in both success rate and search efficiency (on average: +37.69% SR, +28.96% SPL, -30.69% MSS, and -46.40% NE). While promising, the performance gap compared to humans highlights the need for better semantic reasoning and spatial exploration capabilities in AVOS tasks. This work establishes a foundation for future advances in embodied target search. Dataset and source code are available at https://anonymous.4open.science/r/CityAVOS-3DF8.

摘要

城市环境中的空中视觉目标搜索(AVOS)任务要求无人机(UAV)在没有外部引导的情况下,利用视觉和文本线索自主搜索并识别目标物体。现有方法因冗余语义处理、相似物体区分以及探索-利用困境等问题,在复杂城市环境中表现不佳。为填补这一空白并支持AVOS任务,我们提出了CityAVOS——首个用于城市常见物体自主搜索的基准数据集。该数据集包含6个物体类别的2,420项任务,涵盖不同难度级别,可全面评估无人机代理的搜索能力。

为解决AVOS任务,我们还提出PRPSearcher(感知-推理-规划搜索器),这是一种由多模态大语言模型(MLLM)驱动的新型代理方法,模拟人类三层认知机制。具体而言,PRPSearcher构建了三种专用地图:以物体为中心的动态语义地图(增强空间感知)、基于语义吸引力的3D认知地图(用于目标推理)以及3D不确定性地图(实现探索-利用平衡搜索)。此外,该方法通过去噪机制降低相似物体的干扰,并采用"灵感促进思维"(IPT)提示机制实现自适应行动规划。

在CityAVOS上的实验结果表明,PRPSearcher在成功率与搜索效率上均超越现有基线(平均提升:+37.69% SR,+28.96% SPL,-30.69% MSS,-46.40% NE)。尽管表现良好,但与人类性能的差距仍凸显出AVOS任务需要更强的语义推理和空间探索能力。本研究为具身目标搜索的未来发展奠定了基础。数据集与源代码详见:https://anonymous.4open.science/r/CityAVOS-3DF8。


BLAB: Brutally Long Audio Bench

Abstract

arXiv:2505.03054v2 Announce Type: replace Abstract: Developing large audio language models (LMs) capable of understanding diverse spoken interactions is essential for accommodating the multimodal nature of human communication and can increase the accessibility of language technologies across different user populations. Recent work on audio LMs has primarily evaluated their performance on short audio segments, typically under 30 seconds, with limited exploration of long-form conversational speech segments that more closely reflect natural user interactions with these models. We introduce Brutally Long Audio Bench (BLAB), a challenging long-form audio benchmark that evaluates audio LMs on localization, duration estimation, emotion, and counting tasks using audio segments averaging 51 minutes in length. BLAB consists of 833+ hours of diverse, full-length audio clips, each paired with human-annotated, text-based natural language questions and answers. Our audio data were collected from permissively licensed sources and underwent a human-assisted filtering process to ensure task compliance. We evaluate six open-source and proprietary audio LMs on BLAB and find that all of them, including advanced models such as Gemini 2.0 Pro and GPT-4o, struggle with the tasks in BLAB. Our comprehensive analysis reveals key insights into the trade-offs between task difficulty and audio duration. In general, we find that audio LMs struggle with long-form speech, with performance declining as duration increases. They perform poorly on localization, temporal reasoning, counting, and struggle to understand non-phonemic information, relying more on prompts than audio content. BLAB serves as a challenging evaluation framework to develop audio LMs with robust long-form audio understanding capabilities.

摘要

开发能够理解多样化语音交互的大型音频语言模型(LMs)对于适应人类沟通的多模态特性至关重要,并能提升语言技术在不同用户群体中的可及性。当前音频语言模型的研究主要评估其在短音频片段(通常不超过30秒)上的表现,而对更贴近用户自然交互的长篇对话语音段落的探索仍显不足。我们提出"超长音频基准测试"(BLAB),这是一个具有挑战性的长篇音频评估体系,通过平均时长达51分钟的音频片段,对音频语言模型在定位、时长估计、情感识别和计数等任务上的表现进行测试。BLAB包含833小时以上的多样化完整音频片段,每个片段均配有人工标注的文本问答对。所有音频数据均来自许可协议允许的源素材,并经过人工辅助筛选以确保任务合规性。我们对六款开源和商业音频语言模型在BLAB上的表现进行评估,发现包括Gemini 2.0 Pro和GPT-4o在内的先进模型均难以胜任BLAB的任务。综合分析揭示了任务难度与音频时长之间的关键权衡关系:总体而言,音频语言模型在处理长篇语音时表现欠佳,其性能随时长增加而下降;在定位、时序推理、计数等任务上表现较差,且更依赖提示信息而非音频内容本身。BLAB为开发具有强大长篇音频理解能力的音频语言模型提供了具有挑战性的评估框架。


From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks

Abstract

arXiv:2409.04168v2 Announce Type: replace-cross Abstract: To reduce the need for human annotations, large language models (LLMs) have been proposed as judges of the quality of other candidate models. The performance of LLM judges is typically evaluated by measuring the correlation with human judgments on generative tasks such as summarization or machine translation. In contrast, we study LLM judges on mathematical reasoning tasks. These tasks require multi-step reasoning, and the correctness of their solutions is verifiable, enabling a more objective evaluation. We perform a detailed performance analysis and find that easy samples are easy to judge, and difficult samples are difficult to judge. Our analysis uncovers a strong correlation between judgment performance and the candidate model task performance, indicating that judges tend to favor higher-quality models even if their answer is incorrect. As a consequence, we test whether we can predict the behavior of LLM judges using simple features such as part-of-speech tags and find that we can correctly predict 70%-75% of judgments. We conclude this study by analyzing practical use cases, showing that LLM judges consistently detect the on-average better model but largely fail if we use them to improve task performance.

摘要

为减少对人类标注的依赖,大型语言模型(LLMs)被提议作为评估其他候选模型质量的裁判。通常通过测量LLM裁判在摘要生成或机器翻译等生成性任务中与人类评判的相关性来评估其性能。与此不同,我们研究了LLM裁判在数学推理任务中的表现。这些任务需要多步推理,且其解决方案的正确性可验证,从而能够进行更客观的评估。我们进行了详细的性能分析,发现简单样本易于评判,而困难样本难以评判。分析揭示了裁判性能与候选模型任务性能之间的强相关性,表明裁判倾向于青睐质量更高的模型,即使其答案错误。基于此,我们测试了是否可以使用词性标注等简单特征预测LLM裁判的行为,发现能正确预测70%-75%的评判结果。最后通过分析实际用例得出结论:LLM裁判能持续检测出平均更优的模型,但若将其用于提升任务性能则大多失效。


UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation

Abstract

arXiv:2501.05014v2 Announce Type: replace-cross Abstract: The UAV-VLA (Visual-Language-Action) system is a tool designed to facilitate communication with aerial robots. By integrating satellite imagery processing with the Visual Language Model (VLM) and the powerful capabilities of GPT, UAV-VLA enables users to generate general flight paths-and-action plans through simple text requests. This system leverages the rich contextual information provided by satellite images, allowing for enhanced decision-making and mission planning. The combination of visual analysis by VLM and natural language processing by GPT can provide the user with the path-and-action set, making aerial operations more efficient and accessible. The newly developed method showed the difference in the length of the created trajectory in 22% and the mean error in finding the objects of interest on a map in 34.22 m by Euclidean distance in the K-Nearest Neighbors (KNN) approach.

摘要

无人机视觉-语言-动作(UAV-VLA)系统是一种旨在简化与空中机器人通信的工具。该系统通过将卫星图像处理与视觉语言模型(VLM)及GPT的强大功能相结合,使用户能够通过简单的文本请求生成通用飞行路径与动作方案。该系统充分利用卫星图像提供的丰富上下文信息,从而提升决策制定与任务规划能力。VLM的视觉分析与GPT的自然语言处理相结合,可为用户提供路径-动作集合,使空中操作更加高效便捷。新开发的方法显示,在K最近邻(KNN)算法中,所生成轨迹的长度差异为22%,而通过欧氏距离计算的地图上目标物体定位平均误差为34.22米。


HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding

Abstract

arXiv:2501.01645v3 Announce Type: replace-cross Abstract: Multimodal large language models have become a popular topic in deep visual understanding due to many promising real-world applications. However, hour-long video understanding, spanning over one hour and containing tens of thousands of visual frames, remains under-explored because of 1) challenging long-term video analyses, 2) inefficient large-model approaches, and 3) lack of large-scale benchmark datasets. Among them, in this paper, we focus on building a large-scale hour-long long video benchmark, HLV-1K, designed to evaluate long video understanding models. HLV-1K comprises 1009 hour-long videos with 14,847 high-quality question answering (QA) and multi-choice question asnwering (MCQA) pairs with time-aware query and diverse annotations, covering frame-level, within-event-level, cross-event-level, and long-term reasoning tasks. We evaluate our benchmark using existing state-of-the-art methods and demonstrate its value for testing deep long video understanding capabilities at different levels and for various tasks. This includes promoting future long video understanding tasks at a granular level, such as deep understanding of long live videos, meeting recordings, and movies.

摘要

多模态大语言模型因其众多前景广阔的实际应用,已成为深度视觉理解领域的热门研究方向。然而,时长超过一小时、包含数万视觉帧的长时间视频理解仍存在研究空白,这主要源于三大挑战:1) 长期视频分析的复杂性;2) 大模型方法的低效性;3) 大规模基准数据集的缺失。本文聚焦于构建首个大规模小时级长视频基准HLV-1K,用于评估长视频理解模型。该数据集包含1009条小时级视频,配有14,847个高质量问答对(QA)和多项选择题(MCQA),所有查询均具有时间感知特性且标注类型多样,涵盖帧级、事件内级、跨事件级和长期推理任务。通过现有最先进方法的基准测试,我们验证了该数据集在多层次、多任务深度长视频理解能力评估方面的价值,包括促进未来对长直播视频、会议录像和电影等细粒度长视频理解任务的发展。


CursorCore: Assist Programming through Aligning Anything

Abstract

arXiv:2410.07002v3 Announce Type: replace-cross Abstract: Large language models have been successfully applied to programming assistance tasks, such as code completion, code insertion, and instructional code editing. However, these applications remain insufficiently automated and struggle to effectively integrate various types of information during the programming process, including coding history, current code, and user instructions. In this work, we propose a new conversational framework that comprehensively integrates these information sources, collect data to train our models and evaluate their performance. Firstly, to thoroughly evaluate how well models align with different types of information and the quality of their outputs, we introduce a new benchmark, APEval (Assist Programming Eval), to comprehensively assess the performance of models in programming assistance tasks. Then, for data collection, we develop a data generation pipeline, Programming-Instruct, which synthesizes training data from diverse sources, such as GitHub and online judge platforms. This pipeline can automatically generate various types of messages throughout the programming process. Finally, using this pipeline, we generate 219K samples, fine-tune multiple models, and develop the CursorCore series. We show that CursorCore outperforms other models of comparable size. This framework unifies applications such as inline chat and automated editing, contributes to the advancement of coding assistants. Code, models and data are freely available at https://github.com/TechxGenus/CursorCore.

摘要

大型语言模型已成功应用于编程辅助任务,如代码补全、代码插入和指令式代码编辑。然而,这些应用的自动化程度仍显不足,且难以有效整合编程过程中的各类信息,包括编码历史、当前代码和用户指令。本研究提出了一种新型会话框架,全面整合这些信息源,通过数据收集训练模型并评估其性能。首先,为系统评估模型与不同类型信息的对齐程度及其输出质量,我们引入新基准APEval(辅助编程评估),全面衡量模型在编程辅助任务中的表现。其次,在数据收集方面,我们开发了Programming-Instruct数据生成管道,通过整合GitHub和在线评测平台等多源数据合成训练数据。该管道可自动生成编程全流程中的各类消息。最终,利用该管道生成219K样本,对多个模型进行微调,开发出CursorCore系列。实验表明CursorCore在同等规模模型中表现优异。该框架统一了行内聊天与自动化编辑等应用,推动了编程助手的发展。代码、模型及数据已开源:https://github.com/TechxGenus/CursorCore。


Exploring Generative AI Techniques in Government: A Case Study

Abstract

arXiv:2504.10497v2 Announce Type: replace-cross Abstract: The swift progress of Generative Artificial intelligence (GenAI), notably Large Language Models (LLMs), is reshaping the digital landscape. Recognizing this transformative potential, the National Research Council of Canada (NRC) launched a pilot initiative to explore the integration of GenAI techniques into its daily operation for performance excellence, where 22 projects were launched in May 2024. Within these projects, this paper presents the development of the intelligent agent Pubbie as a case study, targeting the automation of performance measurement, data management and insight reporting at the NRC. Cutting-edge techniques are explored, including LLM orchestration and semantic embedding via RoBERTa, while strategic fine-tuning and few-shot learning approaches are incorporated to infuse domain knowledge at an affordable cost. The user-friendly interface of Pubbie allows general government users to input queries in natural language and easily upload or download files with a simple button click, greatly reducing manual efforts and accessibility barriers.


The Impact of Large Language Models on Open-source Innovation: Evidence from GitHub Copilot

Abstract

arXiv:2409.08379v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have been shown to enhance individual productivity in guided settings. Whereas LLMs are likely to also transform innovation processes in a collaborative work setting, it is unclear what trajectory this transformation will follow. Innovation in these contexts encompasses both capability innovation that explores new possibilities by acquiring new competencies in a project and iterative innovation that exploits existing foundations by enhancing established competencies and improving project quality. Whether LLMs affect these two aspects of collaborative work and to what extent is an open empirical question. Open-source development provides an ideal setting to examine LLM impacts on these innovation types, as its voluntary and open/collaborative nature of contributions provides the greatest opportunity for technological augmentation. We focus on open-source projects on GitHub by leveraging a natural experiment around the selective rollout of GitHub Copilot (a programming-focused LLM) in October 2021, where GitHub Copilot selectively supported programming languages like Python or Rust, but not R or Haskell. We observe a significant jump in overall contributions, suggesting that LLMs effectively augment collaborative innovation in an unguided setting. Interestingly, Copilot's launch increased iterative innovation focused on maintenance-related or feature-refining contributions significantly more than it did capability innovation through code-development or feature-introducing commits. This disparity was more pronounced after the model upgrade in June 2022 and was evident in active projects with extensive coding activity, suggesting that as both LLM capabilities and/or available contextual information improve, the gap between capability and iterative innovation may widen. We discuss practical and policy implications to incentivize high-value innovative solutions.

摘要

大型语言模型(LLMs)已被证实在指导性环境中能提升个体生产力。尽管LLMs同样可能改变协作工作环境中的创新流程,但这一转变的具体路径尚不明确。此类创新既包括通过项目中新能力的获取来探索新可能性的能力创新,也涵盖通过增强现有基础、完善既定能力与提升项目质量来实现的迭代创新。LLMs是否以及多大程度影响协作工作中这两类创新,仍是一个待验证的实证问题。开源开发为考察LLMs对这两类创新的影响提供了理想场景,因其自愿开放/协作的贡献特性为技术增强提供了最大可能。我们以GitHub开源项目为研究对象,利用2021年10月GitHub Copilot(一款聚焦编程的LLM)选择性推出的自然实验——该工具支持Python或Rust等语言但不支持R或Haskell。观测数据显示总体贡献量显著跃升,表明LLMs在非指导性环境中有效增强了协作创新。值得注意的是,Copilot的发布使聚焦维护类或功能优化的迭代创新增长幅度显著高于通过代码开发或功能引入提交实现的能力创新。这一差异在2022年6月模型升级后更为显著,且在编码活动频繁的活跃项目中表现突出,暗示随着LLM能力和/或可用上下文信息的提升,能力创新与迭代创新间的差距可能进一步扩大。最后我们探讨了激励高价值创新方案的实践与政策启示。


UAV-VLRR: Vision-Language Informed NMPC for Rapid Response in UAV Search and Rescue

Abstract

arXiv:2503.02465v2 Announce Type: replace-cross Abstract: Emergency search and rescue (SAR) operations often require rapid and precise target identification in complex environments where traditional manual drone control is inefficient. In order to address these scenarios, a rapid SAR system, UAV-VLRR (Vision-Language-Rapid-Response), is developed in this research. This system consists of two aspects: 1) A multimodal system which harnesses the power of Visual Language Model (VLM) and the natural language processing capabilities of ChatGPT-4o (LLM) for scene interpretation. 2) A non-linearmodel predictive control (NMPC) with built-in obstacle avoidance for rapid response by a drone to fly according to the output of the multimodal system. This work aims at improving response times in emergency SAR operations by providing a more intuitive and natural approach to the operator to plan the SAR mission while allowing the drone to carry out that mission in a rapid and safe manner. When tested, our approach was faster on an average by 33.75% when compared with an off-the-shelf autopilot and 54.6% when compared with a human pilot. Video of UAV-VLRR: https://youtu.be/KJqQGKKt1xY

摘要

紧急搜救(SAR)任务通常需要在复杂环境中快速精确地识别目标,而传统的手动无人机操控效率低下。为解决这一问题,本研究开发了一套快速SAR系统UAV-VLRR(视觉-语言-快速响应)。该系统包含两个核心模块:1)基于视觉语言模型(VLM)和ChatGPT-4o大语言模型(LLM)的多模态场景解析系统;2)内置避障功能的非线性模型预测控制(NMPC),使无人机能根据多模态系统的输出快速响应飞行。该研究旨在通过为操作员提供更直观自然的SAR任务规划方式,同时确保无人机快速安全执行任务,从而提升紧急搜救的响应速度。测试表明,与商用自动驾驶仪相比,本方案平均提速33.75%;与人工操控相比,平均提速54.6%。UAV-VLRR演示视频:https://youtu.be/KJqQGKKt1xY


SMI: An Information-Theoretic Metric for Predicting Model Knowledge Solely from Pre-Training Signals

Abstract

arXiv:2502.04066v2 Announce Type: replace-cross Abstract: The GPT-4 technical report highlights the possibility of predicting model performance on downstream tasks using only pre-training signals, though detailed methodologies are absent. Such predictive capabilities are essential for resource-efficient pre-training and the construction of task-aligned datasets. In this paper, we aim to predict performance in closed-book question answering (QA), a vital downstream task indicative of a model's internal knowledge. We address three primary challenges: (1) limited access to and understanding of pre-training corpora, (2) limitations of current evaluation methods for pre-trained models, and (3) limitations of frequency-based metrics in predicting model performance. In response to these challenges, we conduct large-scale retrieval and semantic analysis across the pre-training corpora of 21 publicly available and 3 custom-trained large language models. Subsequently, we develop a multi-template QA evaluation framework incorporating paraphrased question variants. Building on these foundations, we propose Size-dependent Mutual Information (SMI), an information-theoretic metric that linearly correlates pre-training data characteristics, model size, and QA accuracy, without requiring any additional training. The experimental results demonstrate that SMI outperforms co-occurrence-based baselines, achieving R2R^2 > 0.75 on models with over one billion parameters. Theoretical analysis further reveals the marginal benefits of scaling model size and optimizing data, indicating that the upper limit of specific QA task accuracy is approximately 80%. Our project is available at https://github.com/yuhui1038/SMI.

摘要

GPT-4技术报告指出,仅通过预训练信号即可预测模型在下游任务中的性能,但未提供具体方法。这种预测能力对于资源高效的预训练和任务对齐数据集的构建至关重要。本文旨在预测闭卷问答(QA)这一关键下游任务的性能,该任务能反映模型的内部知识水平。我们解决了三个主要挑战:(1)预训练语料库的访问和理解受限;(2)当前预训练模型评估方法的局限性;(3)基于频率的指标在预测模型性能方面的不足。针对这些挑战,我们对21个公开可用和3个自定义训练的大语言模型的预训练语料库进行了大规模检索和语义分析。随后,开发了一个包含转述问题变体的多模板QA评估框架。在此基础上,提出了规模依赖互信息(SMI),这是一种信息理论指标,无需额外训练即可线性关联预训练数据特征、模型规模和QA准确率。实验结果表明,SMI优于基于共现的基线方法,在参数超过十亿的模型上实现了R2R^2 > 0.75。理论分析进一步揭示了扩展模型规模和优化数据的边际效益,表明特定QA任务准确率的上限约为80%。项目地址:https://github.com/yuhui1038/SMI。


Cite Before You Speak: Enhancing Context-Response Grounding in E-commerce Conversational LLM-Agents

Abstract

arXiv:2503.04830v3 Announce Type: replace-cross Abstract: With the advancement of conversational large language models (LLMs), several LLM-based Conversational Shopping Agents (CSA) have been developed to help customers smooth their online shopping. The primary objective in building an engaging and trustworthy CSA is to ensure the agent's responses about product factoids are accurate and factually grounded. However, two challenges remain. First, LLMs produce hallucinated or unsupported claims. Such inaccuracies risk spreading misinformation and diminishing customer trust. Second, without providing knowledge source attribution in CSA response, customers struggle to verify LLM-generated information. To address both challenges, we present an easily productionized solution that enables a ''citation experience'' to our customers. We build auto-evaluation metrics to holistically evaluate LLM's grounding and attribution capabilities, suggesting that citation generation paradigm substantially improves grounding performance by 13.83%. To deploy this capability at scale, we introduce Multi-UX-Inference system, which appends source citations to LLM outputs while preserving existing user experience features and supporting scalable inference. Large-scale online A/B tests show that grounded CSA responses improves customer engagement by 3% - 10%, depending on UX variations.

摘要

随着对话式大语言模型(LLM)的发展,基于LLM的对话购物助手(CSA)已被开发用于提升用户在线购物体验。构建引人入胜且可信赖的CSA核心在于确保助手对产品事实的回应准确且基于实证。然而仍存在两大挑战:其一,LLM会产生虚构或无依据的表述,此类错误可能传播虚假信息并削弱用户信任;其二,若CSA回复未提供知识来源标注,用户难以验证LLM生成信息的真实性。针对这些问题,我们提出一种易于生产部署的解决方案,为用户提供"引用体验"。通过建立自动评估指标全面衡量LLM的实证基础与归因能力,结果表明引用生成范式使实证性能提升13.83%。为实现规模化部署,我们开发了多用户体验推理系统(Multi-UX-Inference),在保留现有用户体验功能的同时为LLM输出附加来源引用,并支持可扩展推理。大规模在线A/B测试显示,基于实证的CSA回复可使客户参与度提升3%-10%,具体效果因用户体验设计差异而异。


CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning

Abstract

arXiv:2503.13517v2 Announce Type: replace-cross Abstract: Scientific problem-solving involves synthesizing information while applying expert knowledge. We introduce CURIE, a scientific long-Context Understanding,Reasoning and Information Extraction benchmark to measure the potential of Large Language Models (LLMs) in scientific problem-solving and assisting scientists in realistic workflows. This benchmark introduces ten challenging tasks with a total of 580 problems and solution pairs curated by experts in six disciplines - materials science, condensed matter physics, quantum computing, geospatial analysis, biodiversity, and proteins - covering both experimental and theoretical work-flows in science. We evaluate a range of closed and open LLMs on tasks in CURIE which requires domain expertise, comprehension of long in-context information,and multi-step reasoning. While Gemini Flash 2.0 and Claude-3 show consistent high comprehension across domains, the popular GPT-4o and command-R+ fail dramatically on protein sequencing tasks. With the best performance at 32% there is much room for improvement for all models. We hope that insights gained from CURIE can guide the future development of LLMs in sciences. Evaluation code and data are in https://github.com/google/curie

摘要

科学问题解决涉及信息综合与专家知识应用。我们推出CURIE(科学长上下文理解、推理与信息提取基准),用于评估大语言模型(LLMs)在科学问题解决及辅助现实科研工作流程中的潜力。该基准包含由六个学科(材料科学、凝聚态物理、量子计算、地理空间分析、生物多样性与蛋白质)专家精心设计的十项挑战性任务,共计580个问题-解决方案对,涵盖科学与实验和理论工作流程。我们在需要领域专业知识、长上下文信息理解及多步推理的CURIE任务上评估了多种闭源与开源LLMs。虽然Gemini Flash 2.0和Claude-3展现出跨领域的高理解力,但流行的GPT-4o和command-R+在蛋白质测序任务中表现严重不足。当前最佳模型准确率仅为32%,所有模型均有显著提升空间。我们期望CURIE的研究成果能为科学领域LLMs的未来发展提供指引。评估代码与数据详见https://github.com/google/curie。


AI Hiring with LLMs: A Context-Aware and Explainable Multi-Agent Framework for Resume Screening

Abstract

arXiv:2504.02870v2 Announce Type: replace-cross Abstract: Resume screening is a critical yet time-intensive process in talent acquisition, requiring recruiters to analyze vast volume of job applications while remaining objective, accurate, and fair. With the advancements in Large Language Models (LLMs), their reasoning capabilities and extensive knowledge bases demonstrate new opportunities to streamline and automate recruitment workflows. In this work, we propose a multi-agent framework for resume screening using LLMs to systematically process and evaluate resumes. The framework consists of four core agents, including a resume extractor, an evaluator, a summarizer, and a score formatter. To enhance the contextual relevance of candidate assessments, we integrate Retrieval-Augmented Generation (RAG) within the resume evaluator, allowing incorporation of external knowledge sources, such as industry-specific expertise, professional certifications, university rankings, and company-specific hiring criteria. This dynamic adaptation enables personalized recruitment, bridging the gap between AI automation and talent acquisition. We assess the effectiveness of our approach by comparing AI-generated scores with ratings provided by HR professionals on a dataset of anonymized online resumes. The findings highlight the potential of multi-agent RAG-LLM systems in automating resume screening, enabling more efficient and scalable hiring workflows.

摘要

简历筛选是人才招聘中关键但耗时的环节,要求招聘人员在处理大量求职申请时保持客观、准确和公正。随着大语言模型(LLMs)的发展,其推理能力和庞大知识库为简化和自动化招聘流程提供了新机遇。本研究提出一个基于LLMs的多智能体简历筛选框架,通过系统化处理与评估实现流程自动化。该框架包含四个核心智能体:简历提取器、评估器、摘要生成器和分数格式化器。为增强候选人评估的上下文相关性,我们在简历评估器中集成检索增强生成技术(RAG),可动态融入外部知识源,包括行业专业知识、职业资格证书、大学排名和企业特定招聘标准。这种动态适配机制实现了个性化招聘,弥合了AI自动化与人才获取之间的鸿沟。通过对比AI生成分数与人力资源专家对匿名在线简历数据集的评分,我们验证了该方法的有效性。研究结果凸显了多智能体RAG-LLM系统在自动化简历筛选中的潜力,有助于构建更高效、可扩展的招聘工作流程。


Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

Abstract

arXiv:2504.04717v3 Announce Type: replace-cross Abstract: Recent advancements in large language models (LLMs) have revolutionized their ability to handle single-turn tasks, yet real-world applications demand sophisticated multi-turn interactions. This survey provides a comprehensive review of recent advancements in evaluating and enhancing multi-turn interactions in LLMs. Focusing on task-specific scenarios, from instruction following in diverse domains such as math and coding to complex conversational engagements in roleplay, healthcare, education, and even adversarial jailbreak settings, we systematically examine the challenges of maintaining context, coherence, fairness, and responsiveness over prolonged dialogues. The paper organizes current benchmarks and datasets into coherent categories that reflect the evolving landscape of multi-turn dialogue evaluation. In addition, we review a range of enhancement methodologies under multi-turn settings, including model-centric strategies (contextual learning, supervised fine-tuning, reinforcement learning, and new architectures), external integration approaches (memory-augmented, retrieval-based methods, and knowledge graph), and agent-based techniques for collaborative interactions. Finally, we discuss open challenges and propose future directions for research to further advance the robustness and effectiveness of multi-turn interactions in LLMs. Related resources and papers are available at https://github.com/yubol-cmu/Awesome-Multi-Turn-LLMs.

摘要

近年来,大型语言模型(LLMs)在单轮任务处理能力上取得突破性进展,然而实际应用场景往往需要复杂的多轮交互。本综述系统回顾了LLMs多轮交互评估与增强技术的最新进展。聚焦任务导向场景——从数学、编程等跨领域指令跟随,到角色扮演、医疗健康、教育等复杂会话场景,乃至对抗性越狱环境,我们深入剖析了长对话中保持上下文连贯性、公平性及响应能力的技术挑战。本文对现有评测基准与数据集进行体系化分类,反映了多轮对话评估领域的发展动态。在增强方法层面,我们综述了多轮场景下的三类技术路径:以模型为核心的方法(上下文学习、监督微调、强化学习及新型架构设计)、外部知识整合策略(记忆增强、检索增强及知识图谱应用),以及面向协同交互的智能体技术。最后,我们探讨了当前研究存在的开放性问题,并提出未来研究方向以提升LLMs多轮交互的鲁棒性与有效性。相关资源与文献详见https://github.com/yubol-cmu/Awesome-Multi-Turn-LLMs。


LLMs meet Federated Learning for Scalable and Secure IoT Management

Abstract

arXiv:2504.16032v2 Announce Type: replace-cross Abstract: The rapid expansion of IoT ecosystems introduces severe challenges in scalability, security, and real-time decision-making. Traditional centralized architectures struggle with latency, privacy concerns, and excessive resource consumption, making them unsuitable for modern large-scale IoT deployments. This paper presents a novel Federated Learning-driven Large Language Model (FL-LLM) framework, designed to enhance IoT system intelligence while ensuring data privacy and computational efficiency. The framework integrates Generative IoT (GIoT) models with a Gradient Sensing Federated Strategy (GSFS), dynamically optimizing model updates based on real-time network conditions. By leveraging a hybrid edge-cloud processing architecture, our approach balances intelligence, scalability, and security in distributed IoT environments. Evaluations on the IoT-23 dataset demonstrate that our framework improves model accuracy, reduces response latency, and enhances energy efficiency, outperforming traditional FL techniques (i.e., FedAvg, FedOpt). These findings highlight the potential of integrating LLM-powered federated learning into large-scale IoT ecosystems, paving the way for more secure, scalable, and adaptive IoT management solutions.

摘要

物联网生态系统的快速扩张对可扩展性、安全性和实时决策提出了严峻挑战。传统集中式架构存在延迟、隐私问题和资源消耗过大等缺陷,难以适应现代大规模物联网部署需求。本文提出一种新型联邦学习驱动的大语言模型(FL-LLM)框架,旨在提升物联网系统智能性的同时确保数据隐私和计算效率。该框架将生成式物联网(GIoT)模型与梯度感知联邦策略(GSFS)相结合,根据实时网络条件动态优化模型更新。通过采用混合边缘-云处理架构,我们的方法在分布式物联网环境中实现了智能性、可扩展性和安全性的平衡。基于IoT-23数据集的评估表明,该框架在模型精度、响应延迟和能效方面均优于传统联邦学习技术(如FedAvg、FedOpt)。这些发现凸显了将大语言模型驱动的联邦学习整合到大规模物联网生态系统中的潜力,为开发更安全、可扩展和自适应的物联网管理解决方案开辟了新途径。


Multi-Modal Language Models as Text-to-Image Model Evaluators

Abstract

arXiv:2505.00759v2 Announce Type: replace-cross Abstract: The steady improvements of text-to-image (T2I) generative models lead to slow deprecation of automatic evaluation benchmarks that rely on static datasets, motivating researchers to seek alternative ways to evaluate the T2I progress. In this paper, we explore the potential of multi-modal large language models (MLLMs) as evaluator agents that interact with a T2I model, with the objective of assessing prompt-generation consistency and image aesthetics. We present Multimodal Text-to-Image Eval (MT2IE), an evaluation framework that iteratively generates prompts for evaluation, scores generated images and matches T2I evaluation of existing benchmarks with a fraction of the prompts used in existing static benchmarks. Moreover, we show that MT2IE's prompt-generation consistency scores have higher correlation with human judgment than scores previously introduced in the literature. MT2IE generates prompts that are efficient at probing T2I model performance, producing the same relative T2I model rankings as existing benchmarks while using only 1/80th the number of prompts for evaluation.

摘要

文本到图像(T2I)生成模型的持续改进导致依赖静态数据集的自动评估基准逐渐过时,这促使研究者寻求替代方法来评估T2I技术的进展。本文探讨了多模态大语言模型(MLLMs)作为评估代理的潜力,这些代理通过与T2I模型交互,旨在评估提示生成一致性和图像美学。我们提出了多模态文本到图像评估框架(MT2IE),该框架迭代生成评估提示,对生成的图像进行评分,并以现有静态基准所用提示的一小部分数量匹配T2I模型的评估结果。此外,我们发现MT2IE的提示生成一致性评分与人类判断的相关性高于文献中先前提出的评分。MT2IE生成的提示能高效探测T2I模型性能,仅使用现有基准1/80的提示数量即可产生相同的T2I模型相对排名。


Gradual Binary Search and Dimension Expansion : A general method for activation quantization in LLMs

Abstract

arXiv:2504.13989v2 Announce Type: replace-cross Abstract: Large language models (LLMs) have become pivotal in artificial intelligence, demonstrating strong capabilities in reasoning, understanding, and generating data. However, their deployment on edge devices is hindered by their substantial size, often reaching several billion parameters. Quantization is a widely used method to reduce memory usage and inference time, however LLMs present unique challenges due to the prevalence of outliers in their activations. In this work, we leverage the theoretical advantages of Hadamard matrices over random rotation matrices to push the boundaries of quantization in LLMs. We demonstrate that Hadamard matrices are more effective in reducing outliers, which are a significant obstacle in achieving low-bit quantization. Our method based on a gradual binary search enables 3-bit quantization for weights, activations, and key-value (KV) caches, resulting in a 40% increase in accuracy on common benchmarks compared to SoTA methods. We extend the use of rotation matrices to support non-power-of-2 embedding dimensions, similar to the Qwen architecture, by employing the Paley algorithm. We theoretically demonstrates the superiority of Hadamard matrices in reducing outliers.We achieved 3-bit quantization for weights, activations, and KV cache, significantly enhancing model performance. Our experimental results on multiple models family like Mistral, LLaMA, and Qwen demonstrate the effectiveness of our approach, outperforming existing methods and enabling practical 3-bit quantization.

摘要

大语言模型(LLMs)已成为人工智能领域的核心,在推理、理解和数据生成方面展现出强大能力。然而,其庞大的规模(通常达到数十亿参数)阻碍了在边缘设备上的部署。量化是降低内存占用和推理时间的常用方法,但由于LLMs激活值中普遍存在的离群值,量化面临独特挑战。本研究利用哈达玛矩阵相对于随机旋转矩阵的理论优势,突破了LLMs量化的边界。我们证明哈达玛矩阵能更有效地减少离群值——这是实现低位量化的主要障碍。基于渐进式二分搜索的方法实现了权重、激活值和键值(KV)缓存的3比特量化,在通用基准测试中比现有最优方法准确率提升40%。通过采用Paley算法,我们将旋转矩阵的应用扩展至支持非2的幂次嵌入维度(类似Qwen架构)。理论分析表明哈达玛矩阵在减少离群值方面具有优越性。我们实现了权重、激活值和KV缓存的3比特量化,显著提升了模型性能。在Mistral、LLaMA和Qwen等多个模型系列的实验结果表明,该方法优于现有技术,实现了实用的3比特量化。


Efficient Shapley Value-based Non-Uniform Pruning of Large Language Models

Abstract

arXiv:2505.01731v2 Announce Type: replace-cross Abstract: Pruning large language models (LLMs) is a promising solution for reducing model sizes and computational complexity while preserving performance. Traditional layer-wise pruning methods often adopt a uniform sparsity approach across all layers, which leads to suboptimal performance due to the varying significance of individual transformer layers within the model not being accounted for. To this end, we propose the Shapley Value-based Non-Uniform Pruning (SV-NUP) method for LLMs. This approach quantifies the contribution of each transformer layer to the overall model performance, enabling the assignment of tailored pruning budgets to different layers to retain critical parameters. To further improve efficiency, we design the Sliding Window-based Shapley Value approximation method. It substantially reduces computational overhead compared to exact SV calculation methods. Extensive experiments on various LLMs including LLaMA-v1, LLaMA-v2 and OPT demonstrate the effectiveness of the proposed approach. The results reveal that non-uniform pruning significantly enhances the performance of pruned models. Notably, SV-NUP achieves a reduction in perplexity (PPL) of 18.01% and 19.55% on LLaMA-7B and LLaMA-13B, respectively, compared to SparseGPT at 70% sparsity.

摘要

剪枝大型语言模型(LLM)是一种在保持性能的同时减小模型规模和计算复杂度的有效方法。传统的逐层剪枝方法通常对所有层采用统一的稀疏度策略,由于未考虑模型中各Transformer层的重要性差异,导致性能欠佳。为此,我们提出基于Shapley值的非均匀剪枝方法(SV-NUP)。该方法量化各Transformer层对模型整体性能的贡献度,从而为不同层分配定制化的剪枝预算以保留关键参数。为进一步提升效率,我们设计了基于滑动窗口的Shapley值近似计算方法,相比精确Shapley值计算可显著降低计算开销。在LLaMA-v1、LLaMA-v2和OPT等多种大型语言模型上的实验验证了该方法的有效性。结果表明,非均匀剪枝能显著提升剪枝后模型的性能。值得注意的是,在70%稀疏度下,SV-NUP相较于SparseGPT使LLaMA-7B和LLaMA-13B的困惑度(PPL)分别降低了18.01%和19.55%。